Operations Leadership Lessons from the Crowdstrike Incident

Much has been written about the whys and wherefores of the recent Crowdstrike incident. Without dwelling too much on the past (you can get the background here), the question is, what can we do to plan for the future? We asked our expert analysts what concrete steps organizations can take.

Don’t Trust Your Vendors

Does that sound harsh? It should. We have zero trust in networks or infrastructure and access management, but then we allow ourselves to assume software and service providers are 100% watertight. Security is about the permeability of the overall attack surface—just as water will find a way through, so will risk.

Crowdstrike was previously the darling of the industry, and its brand carried considerable weight. Organizations tend to think, “It’s a security vendor, so we can trust it.” But you know what they say about assumptions…. No vendor, especially a security vendor, should be given special treatment.

Incidentally, for Crowdstrike to declare that this event wasn’t a security incident completely missed the point. Whatever the cause, the impact was denial of service and both business and reputational damage.

Treat Every Update as Suspicious

Security patches aren’t always treated the same as other patches. They may be triggered or requested by security teams rather than ops, and they may be (perceived as) more urgent. However, there’s no such thing as a minor update in security or operations, as anyone who has experienced a bad patch will know.

Every update should be vetted, tested, and rolled out in a way that manages the risk. Best practice may be to test on a smaller sample of machines first, then to do the wider rollout, for example, by a sandbox or a limited install. If you can’t do that for whatever reason (perhaps contractual), consider yourself working at risk until sufficient time has passed.

For example, the Crowdstrike patch was an obligatory install, however some organizations we speak to managed to block the update using firewall settings. One organization used its SSE platform to block the update servers once it identified the bad patch. As it had good alerting, this took about 30 minutes for the SecOps team to recognize and deploy.

Another throttled the Crowdstrike updates to 100Mb per minute – it was only hit with six hosts and 25 endpoints before it set this to zero.

Minimize Single Points of Failure

Back in the day, resilience came through duplication of specific systems––the so-called “2N+1” where N is the number of components. With the advent of cloud, however, we’ve moved to the idea that all resources are ephemeral, so we don’t have to worry about that sort of thing. Not true.

Ask the question: “What happens if it fails?” where “it” can mean any element of the IT architecture. For example, if you choose to work with a single cloud provider, look at specific dependencies––is it about a single virtual machine or a region? In this case, the Microsoft Azure issue was confined to storage in the Central region, for example. For the record, it can and should also refer to the detection and response agent itself.

In all cases, do you have another place to failover to should “it” no longer function? Comprehensive duplication is (largely) impossible for multi-cloud environments. A better approach is to define which systems and services are business critical based on the cost of an outage, then to spend money on how to mitigate the risks. See it as insurance; a necessary spend.

Treat Backups as Critical Infrastructure

Each layer of backup and recovery infrastructure counts as a critical business function and should be hardened as much as possible. Unless data exists in three places, it’s unprotected because if you only have one backup, you won’t know which data is correct; plus, failure is often between the host and online backup, so you also need offline backup.

The Crowdstrike incident cast a light on enterprises that lacked a baseline of failover and recovery capability for critical server-based systems. In addition, you need to have confidence that the environment you are spinning up is “clean” and resilient in its own right.

In this incident, a common issue was that Bitlocker encryption keys were stored in a database on a server that was “protected” by Crowdstrike. To mitigate this, consider using a completely different set of security tools for backup and recovery to avoid similar attack vectors.

Plan, Test, and Revise Failure Processes

Disaster recovery (and this was a disaster!) is not a one-shot operation. It may feel burdensome to constantly think about what could go wrong, so don’t––but perhaps worry quarterly. Conduct a thorough assessment of points of weakness in your digital infrastructure and operations, and look to mitigate any risks.

As per one discussion, all risk is business risk, and the board is in place as the ultimate arbiter of risk management. It is everyone’s job to communicate risks and their business ramifications––in financial terms––to the board. If the board chooses to ignore these, then they have made a business decision like any other.

The risk areas highlighted in this case are risks associated with bad patches, the wrong kinds of automation, too much vendor trust, lack of resilience in secrets management (i.e., Bitlocker keys), and failure to test recovery plans for both servers and edge devices.

Look to Resilient Automation

The Crowdstrike situation illustrated a dilemma: We can’t 100% trust automated processes. The only way we can deal with technology complexity is through automation. The lack of an automated fix was a major element of the incident, as it required companies to “hand touch” each device, globally.

The answer is to insert humans and other technologies into processes at the right points. Crowdstrike has already acknowledged the inadequacy of its quality testing processes; this was not a complex patch, and it would likely have been found to be buggy had it been tested properly. Similarly, all organizations need to have testing processes up to scratch.

Emerging technologies like AI and machine learning could help predict and prevent similar issues by identifying potential vulnerabilities before they become problems. They can also be used to create test data, harnesses, scripts, and so on, to maximize test coverage. However, if left to run without scrutiny, they could also become part of the problem.

Revise Vendor Due Diligence

This incident has illustrated the need to review and “test” vendor relationships. Not just in terms of services provided but also contractual arrangements (and redress clauses to enable you to seek damages) for unexpected incidents and, indeed, how vendors respond. Perhaps Crowdstrike will be remembered more for how the company, and CEO George Kurtz, responded than for the issues caused.

No doubt lessons will continue to be learned. Perhaps we should have independent bodies audit and certify the practices of technology companies. Perhaps it should be mandatory for service providers and software vendors to make it easier to switch or duplicate functionality, rather than the walled garden approaches that are prevalent today.

Overall, though, the old adage applies: “Fool me once, shame on you; fool me twice, shame on me.” We know for a fact that technology is fallible, yet we hope with every new wave that it has become in some way immune to its own risks and the entropy of the universe. With technological nirvana postponed indefinitely, we must take the consequences on ourselves.

Contributors: Chris Ray, Paul Stringfellow, Jon Collins, Andrew Green, Chet Conforte, Darrel Kent, Howard Holton