Hub4Business

Blue Screen Of Death - Lessons Learned From The Largest IT Outage In The History

The recent CrowdStrike and Microsoft outage has resulted in a massive "IT outage," causing flight delays, disruption in financial and healthcare services, and much more. While this outage wasn't due to a cyber-attack, it reinforces the disastrous impact of technology in daily life and disruption when it gets into the hands of bad actors.

Arun Mamgai
Blue Screen Of Death - Lessons Learned From The Largest IT Outage In The History
info_icon

The recent CrowdStrike and Microsoft outage has resulted in a massive "IT outage," causing flight delays, disruption in financial and healthcare services, and much more. While this outage wasn't due to a cyber-attack, it reinforces the disastrous impact of technology in daily life and disruption when it gets into the hands of bad actors. There are reports that cyber attackers tried to benefit from the CrowdStrike outage by attempting to infiltrate the secure networks.

On July 19, 2024, CrowdStrike released a faulty update to its Falcon security software that caused approx. 8.5 million Microsoft Windows to crash and display a "Blue Screen of Death," disrupting critical services and business operations and resulting in the largest IT outage that continued over the weekend. The blackout impacted systems used by essential industries, causing long lines at airports, flight cancellations, payment issues at banks, and problems with hospital appointment systems.

We have worked with the cybersecurity and data science specialist, Arun Mamgai (LinkedIn profile is https://www.linkedin.com/in/arun-mamgai-10656a4), to understand the root cause of CrowdStrike outage and possible remediation to avoid similar incidents in future. Mr. Mamgai recently participated in the national news channel to share his expert opinion on this topic with viewers. He provides a unique perspective on the interplay between cybersecurity and generative artificial intelligence and has been particularly influential, showcasing his thought leadership in cybersecurity, data privacy, and artificial intelligence. His insights in cybersecurity and AI are commendable and pivotal in shaping the future of digital transformation.

He has provided the recommendations below based on his recent articles "Cybersecurity and Generative AI: Friend or Foe of the digital transformation" and "How CISOs Can Take Advantage of the Balanced Scorecard Method," published in ISACA (the Information Systems Audit and Control Association), a leading global association providing cybersecurity guidance, governance, and benchmark tools.

  1. Software Supply Chain Security - The global industry collapsed on July 19 because approximately 1% of Windows computers experienced a "Blue Screen of Death" due to a faulty update by CrowdStrike vendor. The lack of quality testing by CrowdStrike resulted in this chaos, but the extreme dependency across the software supply chain must be carefully reviewed and protected. The Windows Kernel access to the 3rd Party (CrowdStrike) should have checks and balances to avoid such collapse in the future. The Software Bill of Material (SBOM) includes all underlying software artifact components and metadata that can proactively alert any malware dependencies.

  2. Automatic Upgrade - This outage impacted Microsoft devices that were configured for automatic upgrade, which means software will be installed as soon as the new version is available. Organizations should revisit this approach and avoid automatic upgrades unless mandated by the company's Info security team. The automatic rollout to the field devices must be thoroughly analyzed because the ​​consequences of faulty software landing in widely adopted field solutions are far-reaching.

  3. Automate Quality Testing - The faulty deployment of sensors on the customer's Windows devices was the root cause of this issue. Organizations must ask vendors to test the new version on the lower environment before pushing changes to the production environment. CrowdStrike must evaluate internal pipeline processes because this issue could have been easily detected in the lower environment. The pipeline discrepancy and manual error led to this catastrophe, which can be avoided by automating the pipeline and quality testing.

  4. Enhance Deployment Process—The new software upgrade was made available across all regions at the same time. The impact could have been minimized if the change was deployed to a limited set of customers or regions at the beginning. Before deploying new changes, an independent deployment risk assessment and quality review must be enforced.

  5. Partnership b/w Microsoft and its partners - This issue was reported only for Windows devices (Mac and Linux devices were immune). After Microsoft reached a settlement with the EU about a monopoly issue a few years ago, it provided privileged kernel access to its partners, including CrowdStrike. This Blue Screen of Death issue would have been avoidable if CrowdStrike hadn't had Kernel access. However, Microsoft should revisit its approach now in providing root access to its partners because it's a critical risk to Windows devices if any third-party vendor can cause a "blue screen of death" via faulty deployment. Trust in Windows devices must be restored, and any upgrade in kernel drivers must undergo extensive testing.

  6. Business Continuity and Disaster Recovery - Customers across industries lost billions of dollars because of a single technology upgrade that created a tsunami of downstream consequences. A thorough analysis of business continuity and disaster recovery must be conducted on a periodic basis, and organizations must be prepared to remain productive during a disruptive event like this. Additionally, they should make vendors accountable for these incidents or opt for insurance coverage. The liability limit in the vendor contract should be thoroughly reviewed to cover significant impacts in the event of any malfunction or disruption.

Arun Mamgai has recommended revolutionary "Generative AI - Friend or Foe for Digital Transformation", "Balanced Scorecard Method", and "Zero Trust MLOps" solutions that are turning out to be truly transformative for CISOs due to their holistic approach to protecting enterprise assets from malicious threats. According to Arun Mamgai, managing the software supply chain integrity, monitoring the security of container images and runtime environments, and enforcing compliance policies can be overwhelming for enterprises. It requires continuous monitoring and management across all five stages of the MLops process. Organizations must invest in an enterprise platform providing multiple dimensions of an enterprise-ready Zero Trust MLOps. At the same time, a balanced scorecard-based cybersecurity strategy map can reduce business risks, increase productivity, enhance customer trust, and help enterprises grow without the fear of a data breach and technology disruption.

Arun Mamgai's recommended solution identifies cybersecurity performance metrics across Financial, Customer, Internal, and Innovation and Learning categories to empower CISOs and their teams to focus on the issues that matter the most. It identifies activities based on business priorities and provides a path for security state analysis, data aggregation and correlation, enforcement of automation and AI-based defense policies, and MFA to avoid a single source of breach.