The Blue Screen of Death (BSOD) is the archenemy of IT administrators worldwide. It is known to pop up when you least expect it, causing disruption and mayhem across Windows systems when it appears. Some argue that its emoticon frown is a more hated character than Clippy.
Many organizations manage software across systems so that if there is an issue, it can be resolved with minimal impact on customers. When you’re a customer who’s a client of an international business, however, risks arise where a software update can hamper business operations.
For CrowdStrike Falcon customers, July 19 will go down as a case study for accelerated MBA online students to remember – where business continuity planning and massive cyber disruption showed the world just how vulnerable IT systems can impact the global economy. How can businesses better prepare so that, in the future, they’re able to effectively respond to business disruption?
The CrowdStrike Outage – What Was It?
To understand how the CrowdStrike outage occurred, it’s helpful to understand how different programs work within computers. Many operating systems, such as Windows, have programs running in different layers – known as the application layer and the kernel layer.
While these programs need access to computing power, they often don’t need access to all of a computer’s components. To ensure a computer can run programs for a user while also staying functional, a layer known as a kernel exists to ensure that the computer runs effectively – by navigating the conflicts that occur when different parts of a computer require different resources.
Think of it as a traffic marshall – a kernel helps to guide traffic, manage conflict, and interface between programs (the vehicles around them) and the physical hardware (the road and the surface they’re on). If there’s a logic fault in the kernel (say the marshall gives conflicting instructions) – everything stops, and you effectively have a traffic jam.
Most programs, such as word processors and web browsers, operate at an application level; however, in some cases, security programs, such as CrowdStrike Falcon, operate at the kernel level. This is because of the role that they play in monitoring systems for infections and intrusions, such as kernel-based malware.
On July 19, 2024, CrowdStrike issued an update for its security program, Falcon. This update contained a logic fault, which caused Windows-based kernels to crash, causing the blue screen of death and a reboot loop, making it impossible to use those systems until they could be remediated by IT teams.
The error resulted in 8.5 million computers going out of action simultaneously across a range of industries, from healthcare to transport and broadcast media. Hospitals canceled non-urgent surgeries across the United States, transport hubs such as airports saw extensive delays that lasted into the weekend, and some news channels went offline as their underlying technology faulted around them.
Understanding Organizational Risks
A week after the incident, CrowdStrike advised that some 97% of impacted machines were restored, showing that a single error can sometimes take a long time to remediate. What can businesses learn from this incident?
Firstly, it’s important to recognize that technology comes with risks. Whether it’s the impacts of power outages, malware, or a system update gone wrong, an organization will likely need to prepare for the worst.
A great way to understand an organization’s risks is to map them out. It is crucial to have employees in the business who understand the software architecture that makes up an organization. This should be regularly updated as new tools are introduced to a company in order to accurately map out any risks that may be present.
Be Prepared
For many businesses, strategic planning, such as a business continuity plan and a crisis response plan, can be incredibly useful for preparing for the next bad event.
A business continuity plan essentially functions as a backup plan. It’s an agreed-upon set of actions that a business will take in the event that a potential threat to a company occurs, such as a system outage or disaster. This plan typically outlines the major stakeholders that need to be notified, what steps need to be taken, and what backup systems or processes need to be followed in the case of an outage. It may also include any steps that need to be taken in terms of applying for compensation or seeking redress through insurers or partners.
A crisis response plan functions in a similar way; however, it includes one vital additional element – the communication strategy that an organization may choose to use when an incident occurs. Communication during a crisis is always critical – and it’s important that all organizations are aware of the benefits that having a BCP and CRP can provide for them.
It’s important that these plans are regularly reviewed and validated, especially as new tools and platforms enter an organization. After all, the last thing any business in crisis needs is an outdated or no longer fit-for-purpose plan. Whether it’s a cyber-attack, a rogue software update, or simple human error – it’s clear that businesses adequately prepare for incidents that occur on their watch. With the recent CrowdStrike reminder costing one large U.S. airline $500 million in lost revenue in a matter of days, it’s a good time to ask – when was the last time you checked your business continuity plan?