CrowdStrike-Microsoft outage: A technological wake-up call
CrowdStrike-Microsoft outage: A technological wake-up call
How a software glitch exposed the fragility of global systems
It's a Hollywood script that almost writes itself. One by one, computers around the world crash and display a blue error screen. Airports, seaports, hospitals, banks, broadcasters, government offices, and hundreds of thousands of other entities are paralyzed or forced to rely on ancient solutions such as pen and paper. The public at home watches in astonishment: is this the moment when the world will stand still?
But reality proves to be more sophisticated than any pretend screenwriter. Because while a screenwriter might have gone for a thriller/action movie, here, it's more of an unfortunate comedy of errors. Not sophisticated hackers on a mission from an unknown power, but a glitch in software code from a company most of the world has never heard of. And instead of everything being resolved in a tense climactic scene full of explosions, here the answer was out of the blue and gray, straight out of the instructions for the beginning computer technician: try turning the computer off and on 15 times.
The global computer crash, which was caused by a malfunction in the software of the cyber protection company CrowdStrike, began on the night between Thursday and Friday (Israel time) and paralyzed computer systems all over the world. In Israel, a series of hospitals, branches of Super Pharm, the H&M chain, banks, car rental companies (which switched to filling out forms manually), and more were affected. Abroad, airlines, banks, stock exchanges, and media outlets were affected by it. Many airlines were forced to suspend their operations, and airports were forced to switch to issuing boarding passes and performing security checks manually, which created delays and worldwide congestion. The British news network, Sky News, had to stop its live broadcasts for many hours, a rare event and perhaps unprecedented. Other victims include hotels (which had to check in guests manually), the London Stock Exchange (which could not publish reports), government bodies (NASA and the FBI announced that they were affected by the malfunction), the public transportation networks in New York and Washington DC, and countless others.
Although the fault has been detected and isolated, it will take days or weeks to treat and restore the operation of all the affected computers, as different manual operations must be performed on each affected computer. At this stage, when the event is still unfolding, it is still too early to assess the scope and magnitude of the damages caused by the malfunction, but these are expected to be significant and interconnected.
For example, the malfunction disrupted the regular operations of the port of Rotterdam in the Netherlands, the largest seaport in Europe. According to the port, this will affect the 3,000 companies that use its services. This is just one result of disruptions in one place. And according to analysts, together with disruptions in additional ports and other port-related services, this could create long-term disruptions in the global logistics system, as it currently operates without room for maneuver.
It will be some time before the effects of the malfunction can be fully assessed. But the cause is already well known. It lies within the cyber company CrowdStrike. This is one of the largest cyber protection companies in the world, which is considered a major competitor of Wiz and Palo Alto Networks. Before the malfunction, it was traded on Nasdaq at a value of $83.5 billion (since then, it has lost about $9 billion of its value).
The source of the problem is the company's EDR (endpoint detection and response) product for Windows computers. This software is installed on the computer, scans it in search of various threats, and responds automatically—a kind of upgraded antivirus. However, while antivirus software can detect existing threats based on an updated database, EDR software can detect unknown threats by analyzing suspicious behavior.
Naturally, EDR software requires frequent updates, and because of the importance and frequency of these updates, they are installed automatically by default. As a routine, the updates are tested and verified before they are distributed to ensure they do not cause damage to the client's system. This time, for an unknown reason—possibly human error, possibly a failure of the tools used to verify the update—a software update was distributed that caused the malfunction. Because EDR software requires access to the most secure levels of an operating system to function properly, a software failure can disrupt the core of the system.
In this case, the CrowdStrike software glitch disrupted the activity of Windows' core applications, leading to a complete system crash. Since CrowdStrike's solution is one of the most common in the world, and since Windows is the preferred operating system for many business applications, the result was the collapse of millions of computer systems globally, and the resulting chaos.
According to Nadav Avital, Director of Threat Research at Imperva, "the malfunction highlights the critical need for risk management in the supply chain. Organizations must make sure that their software suppliers employ strict security measures at all stages of development and distribution. In addition, backup systems and emergency procedures must be established that will enable rapid recovery in the event of similar incidents. By taking these measures, we can reduce the risk of similar incidents in the future and ensure the continuity of business activity in a safe and reliable manner."
Amiram Shachar, founder and CEO of Upwind, added that some lessons can already be learned about the incident: "For CrowdStrike and similar providers, it is essential to thoroughly investigate each version update before releasing it to customers, while understanding that a technical fault can cause significant damage."
CrowdStrike's failure is drawing a lot of fire, and rightfully so. It will pay for it in the form of huge lawsuits, fines, regulatory investigations, and more. There is a real concern for the stability of the company in the medium term. CrowdStrike "has done more to disrupt global business activity than all the damaging attacks combined," Michael Henry, chairman of the cyber defense firm Accelerynt, told Bloomberg. "It demonstrates how much risk we take when we deploy protection software: If these guys get it wrong, they can take down your entire business."
This connects to the second level of failure—that of CrowdStrike customers. All affected customers, without exception, received CrowdStrike's updates automatically, without first testing them in a secure environment. When it comes to an organization with hundreds, thousands, or tens of thousands of computers, particularly one that operates critical services such as a hospital or an airline, this is abnormal behavior that goes against best practices. These require an independent and isolated test of each application and update, especially one that has access to the core of the system. The responsibility may fall on the shoulders of CrowdStrike, but its customers, who could have prevented the malfunction if only they had acted more responsibly, bear part of the blame.
But there is also a deeper problem than a point failure or the irresponsibility of companies. The glitch exposes the fragility of the global computing system, which relies on several large vendors, especially when it comes to core applications. All end computers in the world use one of a limited number of operating systems—Microsoft's Windows, Apple's macOS, Chrome OS, or open-source Linux—with Microsoft controlling almost three-quarters of the market. In enterprise cyber software, CrowdStrike, the second-largest player, has an 18% market share, according to IDC, with 29,000 customers. When one of these links in the chain breaks, the shock waves are wide, and the result is significant damage to the global economy.
The next step should be an examination of the basic conditions that led to the current situation: why a handful of companies, some unknown to the general public, have the power to cause such wide-ranging disruptions in the global economy. How can we introduce more diversity into this concentrated market? This is an urgent task that regulators and the industry itself should take on. Because next time there’s a failure of this magnitude, it is not certain that we will get off so lightly.