A couple of high profile system failures back in the news this week.
First, we got a look at the letter that Charlotte Hogg, Chief Executive Officer of Visa Europe, sent to the U.K. Parliament detailing the issues that caused significant impact to UK cardholders on June 1st.
Given the position that Visa holds within the electronic payments industry, it isn’t surprising that they are subject to high levels of accountability and public scrutiny. And, while we never like to see major system failures, it is always interesting to see how the affected organization responds.
In an attempt to minimize the scale of the problem, Visa provided statistics to indicate that fewer cardholders were impacted than originally suggested - perhaps only 9% of the transactions failed to process correctly, out of the 25.2 million transactions attempted during the outage period. Still, a significant number of cardholders were impacted, enough to make headlines across the globe and initiate the parliamentary inquiry. I am sure it was difficult for CEO Hogg to write “Given that consumers may have access to another payment brand's credit or debit card, cash, or Faster Payments, many more purchases are likely to have been completed. For context, we estimate that 40 percent of debit cardholders in the UK also carry a credit card from MasterCard or American Express.”
During the ongoing effort to calculate the damages caused to impacted cardholders and merchants, there is really no way to estimate the damage done to the Visa brand.
Another thing in the Visa letter that caught my eye right away was the reference to a “very rare, partial failure” of a switch that directs transactions for processing. What stuck out to me was that this specific term was mentioned 6 times in the letter (with one additional "rare" thrown in for good measure). I suppose this is another effort to defend the brand but is this a legitimate defense?
Something for everyone to remember here is that “stuff” happens in all complex systems, even if you are Visa. All payment systems today are complex. There are just too many components, connections, and configurations for us to consider otherwise. I would argue that there is really no longer any such thing as a “very rare” failure. We know that strange and unusual things will happen, which mean planning, preparation, and testing are key. It is OK for Visa to say now that “We are performing a wide review of other failure scenarios, and testing our responses, so that our teams are always prepared to respond to an event, no matter how rare or exceptional”, yet the same opportunity existed before the outage.
It also seems that Visa put too much reliance on people to understand and respond to problems in their environment. The Visa letter states “… it took far longer than it normally would
Again, it’s difficult to imagine why these capabilities were not already in place in the Visa Europe systems. The Visa letter mentions that “The manufacturer has provided us with recommendations on software for automating the monitoring and shutdown of the switch in the event of a similar type of malfunction.” I doubt that this software popped into existence in the past 30 days. Rather, I would guess that the manufacturer has actually been providing this same guidance for years, but could never get the purchase approved. But this is the real world. Decisions are made every day to skips tests, cut costs and take risks. Sometimes things work out and sometimes the stuff hits the fan. There is always more that can be done. Funny how a parliamentary inquiry can change your priorities
The other news was about the TSB debacle that took place earlier this year (also in the UK). While the disaster occurred back in April, we are just getting a look at the preliminary investigative report written by IBM and submitted to Treasury Select committee of MPs investigating the massive impact on bank customers. The report is definitely worth a look.
Key observations from IBM include:
- “A combination new applications, advanced use of microservices, combined with the use of active-active data
centreshave resulted in compounded risk in production.” i.e. the systems are complex.
- “The complexity results in a broad range of technical and functional problems that are hard to diagnose.” There is a need for automated tools to help manage the complex environment.
- Performance testing did not provide the required evidence of capacity and the lack of active-active test environments have
materialisedrisk due to issues with global load balancing (GLB) across data centres”. Financial services companies need to integrate performance testing along with functional testing to ensure systems operate correctly at scale.
- “IBM has not seen evidence of the application of a rigorous set of go-live criteria to prove production readiness.” In other words, there was not enough planning, preparation, and testing
My point here is not to disparage either Visa or TSB. It is merely to illustrate that there are no longer any short-cuts. All payment systems are now so complex that the only way to effectively monitor, manage and test them is via automated systems. All possible, as well as impossible, failure scenarios should be considered, evaluated and planned for - including the failure of the automated systems.
There is just no way to escape the reality that all complex systems are subject to the same immutable law: stuff happens. Or in the words of Robert Burns: “The best-laid schemes o' mice an' men gang aft agley.”