Dror Bereznitsky, Chief Product Officer at Lightrun

Opinion
Software Failures: What companies and development teams should learn from McDonald's and CrowdStrike

"Software failures are a significant challenge for development teams, requiring a systemic approach to development, testing, and quick response," writes Dror Bereznitsky, Chief Product Officer at Lightrun.

In recent years, more and more technology companies have been exposed to the consequences of widespread software failures, a phenomenon that developers need to be aware of and work to prevent. Two prominent cases from recent months that have been widely discussed are the IT system failure at McDonald's and the software update failure at CrowdStrike. These cases illustrate the broad impact such failures can have on businesses and on the software developers themselves. In fact, such failures do not only affect the companies themselves but also consumers and, in some cases, the entire world. In an era where technology has become a cornerstone of our lives, systemic failures can disrupt the lives of millions, impact the economy, and lead to a loss of trust from consumers. This dependence on technology requires developers to emphasize system resilience and quality assurance to prevent situations where minor faults can escalate into global events with far-reaching consequences.
1 View gallery
דרור ברזניצקי Chief Product Officer ב- Lightrun
דרור ברזניצקי Chief Product Officer ב- Lightrun
Dror Bereznitsky, Chief Product Officer at Lightrun
(Photography: Nir Selkman )
McDonald's: Configuration Error and Its Impacts
In March 2024, McDonald's experienced a global IT system failure that affected its ordering systems in its branches worldwide. The failure occurred during a configuration change made by an external provider, causing disruptions in the ordering systems, which led to the temporary closure of restaurants in dozens of countries. As a result, developers and system managers at the company had to deal with a widespread system outage in the global market while finding solutions to restore services quickly and efficiently. This case highlights the importance of thorough testing before making updates and changes to production systems, especially in critical systems relied upon by millions of users. Although the failure was fixed within hours in many markets, the reputational and business damage had already been done.
CrowdStrike: Cybersecurity and Software Failures
Another case considered one of the largest IT failures in history occurred due to a routine software update by CrowdStrike, one of the world's leading cybersecurity companies. This update caused a widespread system failure that led to an unprecedented crisis. As a result of the failure, the company's customers found themselves with disabled computer systems, and the sense of security turned into an existential concern. This technological disaster not only undermined customer confidence but also paralyzed the activities of many organizations that relied on CrowdStrike's digital security solutions. For software development organizations, this case is a loud wake-up call emphasizing the heavy responsibility on their shoulders and the broad impact that changes they make can have.
These events underscore the importance of Integrated Systems Management, a topic that has not received enough attention from companies and managers until recently. Integrated Systems Management refers to all the components and processes involved in the development, distribution, and use of software. Failure to detect problems and vulnerabilities in any of the links in the chain can expose organizations to significant risks.
What Should Companies and Development Teams Learn?
Automation of Testing and CI/CD Processes: Failures like these highlight the importance of automation in development and testing processes. Automation can minimize the chance of human errors, like the one that occurred at CrowdStrike, and identify problems early in the process before they reach the production system.
Observability and Continuous Monitoring: Observability is an approach that covers all aspects of the system, including monitoring, logging, and analysis to understand software behavior in depth. Continuous monitoring of system performance is critical. In many cases, active monitoring and observability can prevent large-scale failures and detect issues early. Real-time monitoring provides developers with information and insights on system and application performance, enabling immediate decision-making and helping prevent severe failures.
Communication and Proactiveness: In crisis situations, communication is critical. Development teams must be prepared to respond quickly and efficiently, providing accurate and clear information to other teams within the company and to management. Communication and transparency help build trust with customers and users and improve the chances of successful recovery from the failure.
Adoption of Advanced Technologies: Using artificial intelligence and machine learning can improve the processes of identifying and responding to issues in IT systems. These technologies can analyze software performance and security in real time, help prevent severe failures, and even alert about them. Dynamic Observability - the ability to add logs and metrics in real time without code changes or affecting system performance - is an advanced approach to Observability that allows developers to understand and respond to problems more accurately and efficiently while saving costs.
Backups and Disaster Recovery: IT departments play a critical role in defining backup and Disaster Recovery policies. It is important to ensure that all core systems are backed up and designed for quick recovery in case of a failure to minimize damage to customers and the business.
Software failures are a significant challenge for development teams, requiring a systemic approach to development, testing, and quick response. These cases emphasize the importance of careful planning, risk management, and the adoption of advanced technologies in development processes that allow for handling failures even while fixing bugs in applications. By preventing application downtimes, these measures save companies from revenue losses estimated at hundreds of millions of dollars a year and protect their reputations. Developers should focus not only on building software that functions well but also on ensuring its resilience against future failures and disruptions to ensure business success in the complex modern reality.
Dror Bereznitsky is the Chief Product Officer at Lightrun, a global leader in Developer Observability.