It looks like most of the organizations affected by the recent CrowdStrike incident have managed to get their operations back under control and operating normally. Sadly, for some of these companies, the effort was very painful and very expensive.
Delta Airlines in particular was severely impacted by the outage and suffered through several exceedingly difficult days of public derision and ridicule as they struggled to recover. In addition to the damage done to Delta’s share price and reputation, CEO Ed Bastian has publicly stated that the CrowdStrike incident has cost the airline $500 million.
Unfortunately, Delta will likely have more pain and expense before they can close the chapter on this incident. The US Department of Transportation has opened an investigation to determine the issues that caused Delta to be impacted so badly and why it took them so much longer to recover than other airlines.
Now that Delta has been able to put a dollar figure on the cost of the incident it comes as no surprise that the airline will try to recover some or all of the expense. CrowdStrike – which has also taken a huge hit on its reputation and share price – is not likely to start writing checks that could total billions of dollars. The legal wrangling may take months or even years to sort out, with a successful outcome not guaranteed for either organization. More pain, more expense.
With this in mind, it is important for everyone associated with the payment industry to fully appreciate how expensive a major outage can be for their organization.
The immediate costs associated with an incident and its remediation may be just the proverbial tip of the iceberg. The ripple effects of an outage can be larger and more severe – including diminished customer trust, loss of business, long-term brand damage, and negative impacts on shareholder value.
O, What a Tangled Web We Weave…
It is also critically important to understand how interconnected and codependent we have all become. Organizations of all sizes and from every location, as well as millions of individuals, find themselves deeply embedded in a complex web of interconnected networks, computer systems, and applications. And for good reason, this “fabric” helps facilitate global communications and commerce, the efficient exchange of data, as well as operational efficiency.
This interdependency on shared infrastructure and services, however, introduces significant systemic risk. An incident or outage occurring in any of these common services can propagate across a network, impacting all connected entities. This risk factor was clearly illustrated by the recent CrowdStrike outage, where a defect in a SaaS software update was quickly distributed to all subscribers, leading to widespread disruptions across the globe.
And while the CrowdStrike outage was unprecedented in terms of its scale and cost, it was not an isolated incident. According to research by PagerDuty, a global leader in digital operations management, the surge of these IT incidents is being driven by increased complexity, rapid expansion of digital services, and insufficient investment in IT infrastructure maintenance.
So, even while many companies were still cleaning up from the CrowdStrike mess, we saw more widespread outages affecting both organizations and consumers on a broad scale:
Long gone are the days when IT geeks read about tech meltdowns in monthly computer rags. In the age of watchdog websites like Downdetector and ThousandEyes, virtually every system failure, network hiccup, and internet outage has become easily visible to any user who wants to access the sites 24/7 from their mobile phone.
These websites empower the average web user to track and report on the health of major companies with unprecedented ease. From Amazon to Zoom and from Microsoft to Oracle, no organization can hide from these watchful eyes – or avoid their vengeful wrath. The message is clear: bad stuff happens, and everyone will know exactly when – and how often – it happens to your company.
The CRN website tracks cloud outages and has published its list of the 10 Biggest Cloud Outages of 2024 (January 1 to June 30, so not including the CrowdStrike incident). It is not a pretty picture and contains details of outages from several of the largest IT and SaaS providers on the planet, including Microsoft, Oracle, and Salesforce, among others.
An Ounce of Prevention… Why Payments Testing is Critical
Given both the increasing frequency and escalating costs of IT outages, every organization must pay close attention to what is happening around them and learn what it can from every incident.
As painful as it may be, many IT organizations and service providers provide specific details of the issues that have affected them and their clients. These incident reports are valuable resources that savvy IT operations in other companies can learn from to improve the reliability of their systems and better protect themselves.
Not only is it important to learn from these mistakes, IT organizations must stay current with the latest technologies and industry best practices, investing in the people and resources needed to fortify their defenses against internal and external threats.
The importance of a comprehensive and robust testing infrastructure as a critical component in any strategic plan to prevent and respond to IT incidents cannot be understated. A long-accepted rule of business is that you cannot manage what you cannot measure. It is similarly important to recognize that if it isn’t tested it isn’t trusted.
When it comes to the issue of testing, companies that invest in modern payment testing solutions will be able to integrate, automate, and optimize their testing operations, enabling them to expand test coverage, improve quality, and respond more quickly to incidents and outages.
Simply automating the most common testing scenarios will free up resources to focus on other higher-value tasks, like upgrading infrastructure or addressing the technical debt that weighs so heavily on many organizations.
According to Eric Johnson, the CIO at PagerDuty; “The costs of these incidents are significant both financially and in lost consumer trust, which is why companies need to invest in automation to mitigate the risk and shorten the time an incident lasts. Investing in automation needs to be at the top of IT leaders’ priority lists.”
Paragon Can Help You Build Robust, Automated Testing Processes
The recent CrowdStrike outage highlights the risks associated with the complex, interconnected, and interdependent IT world that businesses operate in today. The incident also emphasizes the need for every organization operating in one of these high-tech ecosystems to take the associated risk seriously and accept responsibility for protecting itself and its shareholders to the greatest degree possible.
Modern financial services organizations require robust, agile testing capabilities to ensure that everything that can be tested gets tested.
For more than 30 years, Paragon Application Systems has been delivering innovative testing solutions that empower banks, networks, processors, and merchants to streamline their testing processes, enhance accuracy, expand coverage, and minimize risk.
By working with Paragon, organizations gain access to industry expertise, as well as a proven track record, and a corporate commitment to delivering exceptional customer support. This strategic approach to partnership helps clients achieve the highest levels of system availability, reliability, and customer satisfaction in an increasingly connected, complex, and competitive landscape.
Interested in learning more about how we can help you? Talk to one of our payment testing experts today.