March 10, 2020
This article explores how organisations can improve their resilience to threats such as flood, pandemic, and terrorism by using the services of nearshore outsource systems development companies; and, by doing so ensuring the quality of software support to live services regardless of the situation.
The current coronavirus pandemic (Covid-19) is not only a threat to public health; it is also a threat to business operations in organisations of all types (private, public and third sector). Keeping operations going is always harder when a significant number of staff are sick and absent, but in the current pandemic it is the health-risk mitigation itself (closing offices and requiring staff to work from home) that is the immediate threat to business continuity.
When those staff include IT personnel, the risk increases further. IT systems development, IT support and IT operations (including service delivery) are critical functions in business operations.
For most organisations there would be no business operations if IT operations stopped (ref. the widespread cancelling of medical surgeries in the UK National Health Service during the ‘WannaCry’ ransomware attack in 2017); no IT operations if IT support stopped; and, weaker IT support if IT systems development personnel were unavailable.
Though the criticality of software developers in live service delivery might not be obvious to some, they are the people who provide third-line ‘developer’ support to service delivery, including investigating and resolving service outages, and fixing critical software bugs.
As well as playing an important support role, the ‘day-job’ of software developers has changed in recent years as organisations move away from off-line project working to more agile ways of working. These include the ‘DevOps’ approach and other methods where continuous development, delivery and deployment of new code into production is the norm. (i.e. developers are increasingly plugged-in directly to IT operations and business operations and without them it would be impossible to maintain the integrity of customer service).
This article explores how organisations can improve their resilience to threats such as flood, pandemic, and terrorism by using the services of nearshore outsource systems development companies; and, by doing so ensuring the quality of software support to live services regardless of the situation. It recognises that nearshore outsource companies may also be disrupted by external events, but argues that geographic dispersion is one of the most fundamental risk mitigation strategies in emergency planning, as is organisational diversity (including using third party suppliers). Where those organisations at risk include ‘critical national infrastructure’ (healthcare, transport, financial exchanges etc) these types of mitigations are best-practice to the point of being hard to justify not using.
Despite working from home or working remotely being more common than ever, and despite a common assumption that organisations will always have disaster recovery and business continuity plans to cover situations such as the current pandemic, the reality is that many gaps remain in the majority of organisations (including having no plans at all).
Emergency planning (aka contingency planning), disaster recovery and business continuity are important and well-established subjects, organisational functions and professions in their own rights.
Disaster and Emergency Planning degrees are offered by many universities worldwide.
It takes knowledge, skill and experience to be an effective leader in the subject and yet there is a mountain of case-study literature on organisations that were put out of business by either failing to plan for emergency situations, or an important supplier of theirs failing to do so. In almost all cases they had failed to use qualified experts to help draw up, practice and maintain their plans.
Things aren’t getting any better as there is no finite unchanging list of organisations in the world. Instead, the (virtual) list changes daily as old companies close or are absorbed, and new companies spring up.
With business continuity planning not being a mandatory requirement for operating most organisation types, there is no chance of the planning gap ever being closed for good. Even where regulators require firms to have plans and methods for managing contingency risk, those plans and methods have often been shown to be worthless when needed – e.g. requiring firms to have a second data centre for disaster recovery purposes but not stipulating or enforcing the minimum distance from the primary centre or its level of natural protection (say, elevation as a defense against flood risk). Some of the case studies of failure in the face of disaster are almost funny in their ineptitude (e.g. emergency office space and infrastructure for earthquake mitigation being in the same street as the primary offices and infrastructure), while others are tragic. Even when lives are not lost, many jobs are lost each year due to a failure to plan for contingency situations (including cyber-crime such as ransom-ware and denial-of-service attacks). Unfortunately, business continuity plans are often treated as shelf-ware (similar to IT policy documents) in that they exist to tick a box rather than being living documents and processes that are practiced and improved (and people trained) on a continuous basis.
In the case of IT operations, there are some fundamental approaches to business continuity that are used so commonly as to be part of normal day-to-day working. They include backing-up data, and hardware redundancy. Luckily, hardware redundancy and recovering from disaster is so much easier these days than previously.
In previous years, it would have needed extra machines located in a DR suite in a data centre in a different building, hopefully 50 miles or more from the primary site. Far too often, there were either no extra machines (every one bought for the purpose having been repurposed for live production or used by developers or testers for project work), or they were in the same room as production machines, or they were located nearby (i.e. in the same area of geographic risk).
In terms of readiness of the machines for disaster recovery purposes, best-practice evolved from the hot (primary):cold (secondary) model, to hot:warm and then hot:hot (where production is shared 50:50 at 50% or less capacity being used per site, thus allowing one site to handle 100% if required). The solution happened because hot:cold got a terrible reputation for reliability, and it was almost always quicker to recover your primary site than start, test and go-live on the secondary site. With local hardware virtualisation for data centres being followed by genuinely (and massively) virtual Cloud computing, organisations should be free from a lot of their hardware risk these days and this aspect of business continuity should be covered (as long as they are exploiting Cloud computing to the full).
In the case of IT support, many business continuity plans have been found to be worthless for people reasons – e.g. nearly all staff (in every business and IT function) lived close enough to the office for everyone to be flooded and without power at the same time (including at home and in secondary office accommodation). I have seen that happen to the primary office, data centre, secondary data centre, emergency office accommodation, homes and people of a large organisation during a major storm. That firm could have reassured the regulator (at least up to that point) that they had disaster recovery plans in place, but the regulator had not validated the quality of the plan. The firm was out of action for days, and hobbled for weeks, which had a major impact on their clients. This is not a rare exception as anyone who works in emergency planning will tell you.
‘People’ is the factor most-overlooked in the average business continuity plan. Without people nothing will keep working for long.
As mentioned earlier, IT operations and IT support rely more and more on the services of software developers. There are very few organisations that could keep going for more than a day or two (with confidence) without them. And yet they often live and work in the same place as the primary office, and are subject to the same geographic risks (flood, earthquake, terrorism, epidemic etc). One effective way of mitigating this risk is to separate the people geographically.
However, this is easier said than done.
I watched one bank try to do this only to find that the location where they wanted to set up a second development centre did not contain the industry in which they were engaged. Thus, nobody locally had relevant domain experience and any of their current staff relocating there would not have been able to change companies locally within their chosen industry if they so desired.
The idea was abandoned, when they could / should have looked at gaining a nearshore partner to mitigate their people risk. With sufficient numbers of people in a partner firm developing and supporting their critical business systems they would have reduced their risk considerably.
If something had put the primary offices and teams out of commission for a few days an outsource supplier could have taken the full support load (perhaps sacrificing development work temporarily).
The reason I do not cite offshore firms as a possible solution (i.e. those located further away than nearshore firms) is that having your partner team working close to the same time zone as you (plus or minus 3 hours) is invaluable during an emergency. Further away represents a risk in its own right and makes working in an urgent manner more difficult. ‘Nearshore’ for companies in the UK, Europe and Scandinavia should be regarded as Eastern Europe (where high quality systems development outsourcing is a huge industry). For the USA and Canada it typically means South America (especially Argentina, Brazil and Chile but other countries are catching up) though the sizes of the USA and Canada means that onshore solutions can also work.
Though the benefits of using nearshore outsource companies for business continuity purposes may be clear for events such as flooding, are they as clear for pandemics?
Perhaps not as clear, but benefits exist and the business case for adopting nearshore partners as a risk mitigation strategy for all major risks (including pandemics) is strong.
The most important thing is that the business continuity plan for any organisation must manage the two factors of risk to the lowest possible levels at all times. Those factors are probability and impact. Organisations can reduce probability for some risks more than other (eg by not locating themselves in areas subject to floods; and, by having good physical and IT security) but good emergency planning requires us to assume the worst has happened (eg major IT network breach) and know what we are going to do to reduce negative impacts on the organisation and its customers. No organisation can stop a pandemic (yet) but when pandemics or other events occur, having a nearshore IT partner that has been helping develop and support your services (ie not just used in contingency situations) will reduce the impact. Even if everyone in all locations contracts the virus eventually they are unlikely to all be infected concurrently.
The fact that the nearshore team may remain co-located while you are not, is a big plus. As is their long experience in remote working. Though ‘working from home’ is more common than ever, it is rare that an entire IT department has to do so at the same time. Again, there are case studies galore on what really happens when entire teams work from home for days or weeks. It is not what people expect, and some of the things that happen can be surprising.
This is not as intuitive a subject as people assume and it does not always come down to ‘common sense’ (though lack of common sense always becomes a factor when no plans exit).
To summarise: data, hardware and systems redundancy are an important part of ensuring business continuity in emergency situations. This is well known even if not well-practised. However, there is an important aspect that is often overlooked: people (expertise and experience).
Spreading your people and expertise geographically is one of the best things that any organisation can do to reduce the impact of any unwanted event.
For IT operations and business operations, spreading expertise has to include spreading software development and support personnel. Without that knowledge and experience being available at all times, business operations will never continue. Using a nearshore outsource partner on an ongoing basis to help develop and support your critical business systems will give you this confidence of continuity and will help you manage your way through any emergency with confidence.
Cliff Moyce, March 2020
Cliff Moyce has had responsibility for emergency planning in several organisations as Chief Operating Officer. He has also led and managed many responses to emergency events such as flooding and terrorist attacks. Though he has mainly worked in financial services, capital markets and fintech, he also ran the Environment Agency in South-East Wales for a time and led their response to major flooding in 2008.
Cliff may be contacted here: https://www.linkedin.com/in/cliff-moyce-b3a8651/