British Airways recently demonstrated to the world the value of a well-rehearsed disaster recovery procedure. Unfortunately for BA, they did this by showing the cost of not having such a well-rehearsed approach. Like most organizations that are critically dependent on their IT infrastructure, BA has certainly invested heavily in disaster recovery, with back-up data centers, duplicate systems etc. However, it became apparent that what they lacked was the experience of switching between the primary and DR centers smoothly and without data loss.
Of course, many pages have been written recently on the BA melt-down and the likely £150m cost for compensation. Rightly, the BA management has been roundly condemned. Yet, while I am sure I would have been quick to criticize had I been stuck in Heathrow Terminal 5 that weekend, I do believe that those of us who work in financial services shouldn’t be so fast to throw stones.
If you were to ask 100 CIOs when they last exercised their DR switch process, the answers might be surprising. Everyone likely will say they have tested their DR systems with the latest releases, security patches and more. However, relatively few will have tested how successfully the switch can be made between primary and DR systems with the attendant data recovery processes. Given that this switching process usually relies heavily on the accurate execution of manual processes, and we all know how good people under pressure are when they must execute manual processes accurately, surely this is the most important aspect of actually testing a DR system?
So, if you agree with my assertion about testing the DR switch, you’d ask why doesn’t this happen? Conducting such exercises is anything but simple or non-intrusive. In fact, performing such a drill involves so many moving parts that the risk of creating a disaster while trying to prevent one is real. Thus, many organizations, including BA apparently, fail to use realistic test transaction loads on their DR systems while testing them. To create such a load and direct it at two different target systems simultaneously, you need a sophisticated test environment that can direct the required load and traffic mix to multiple concurrent destinations and automatically validate the results.
Surprisingly, many mission critical systems are still tested by teams of subject matter experts using a mixture of manual techniques, home-grown macros and a range of incompatible end-point simulators. This Heath-Robinson approach to testing achieves, at best, a method for testing basic functionality, but it is wholly inadequate for the resilience testing of systems and the critical manual processes around them. So the lack of testing for DR scenarios is largely due to the lack of suitable test environments, resulting in thousands of customers spending more of their holiday in an airport than on their favorite beach.
Financial services providers are not immune from similar impacts as outages at major banks have shown on many occasions. As with the BA case, the damage done is more far reaching than any form of compensation that may be required. In fact, the damage for BA is probably multiples of the estimate above in terms of brand value. Unlike bygone years, the factoring of such damages to brand today is almost exponential with the ubiquitousness of digital devices that allow access to social media. This link alone makes one disenchanted customer’s experience that of potentially millions of other prospective and current customers.
And still there seems to be a laissez-faire view of testing in too many executive suites even when such suites change owners because of failures directly associated with the testing environment at an organization. At some point, with complexity in IT infrastructures growing at almost the same pace as the reach of digital into our lives, there will be a tipping point that will effectively cull the herd separating organizations already moving to modernize their testing environments from those who continue to whistle past the graveyard.