When a CrowdStrike glitch caused global chaos, it provided us with an opportunity to delve deeper into how Windows works and why this failure led to a worldwide disaster.
We’ll start by discussing the concept of “rings,” in the context of computer security, that is, not the ones you wear on your fingers.
Protection rings. According to former Microsoft engineer Dave Plummer, Microsoft’s operating system uses a system of “rings” to separate incoming code into two different types.
Ring 0. The first ring is the kernel mode, or Ring 0, which is reserved for the operating system. This is where the code interacts directly with the hardware, manages memory, and schedules threads. Kernal mode has the most privileges and has unrestricted access to all memory.
Ring 1. The second ring is the user mode, or Ring 1, which is used for processes and apps with normal (and limited) system user privileges. In this mode, apps can only access memory pages that the kernel allows.
If a user app’s thread of execution needs access to privileged resources, it can create an “exception” to request temporary and limited access to a resource with privileges, such as the computer’s memory. Then, the kernel evaluates the request and grants access if it’s valid.
Crashes in user mode are typically non-fatal, but in kernel mode, they can be catastrophic. As Plummer indicates, these two modes differ in the scope of crashes that can occur during their execution.
When an app’s code fails in user mode, the app itself crashes. In this case, a familiar window appears, giving users the option to wait for a response or to close the program directly. However, if the code of a privileged process in kernel mode fails, the entire system crashes. This is when the infamous “blue screen of death,” or BSoD, may appear.
Falcon runs in kernel mode. The “Falcon sensor,” a cybersecurity solution developed by CrowdStrike, operates in kernel mode (theoretically) to gain access to system resources and defend against various cyberattacks. It acts like an “anti-malware” program for servers, monitoring app behavior to identify and prevent potential attacks. CrowdStrike created a device driver to access the kernel and obtain the necessary resources for the sensor to function effectively and safeguard the servers.
A certified driver. To ensure that drivers work as expected, even with privileged access, Microsoft provides WHQL (Windows Hardware Quality Labs) certification to guarantee that these components have been validated by both the developer and Microsoft. They test the components on various platforms and configurations and digitally “sign” them to receive certification.
Conflicting updates. The WHQL certification applies only if the driver remains unchanged. This presents an issue for CrowdStrike, considering its driver needs to constantly evolve to combat new threats. This means that the driver would require ongoing re-evaluation to maintain certification, similar to the process for updating graphics card drivers. However, CrowdStrike uses “definition files” that the driver processes but doesn’t directly include them. These dynamic files are regularly updated and provide the Falcon sensor with information about new threats, allowing it to incorporate these updates into its detection processes.
Definitions that are actually programs. One might expect these updates to be simple text documents with information about new threats. However, in the case of CrowdStrike, Plummer points out that these definition files could actually be full-blown programs in PE (Portable Executable) format. These programs are widespread in the cyber-suggesting or reverse-engineering realm and are executed by the driver. “What you’ve got then is unsigned code of unknown provenance running in full kernel mode,” Plummer explains. Even a simple bug in these programs (definitions) could potentially cause a complete system crash, which is exactly what happened on Friday.
A code full of zeros? When analyzing what caused the IT outage, CrowdStrike mentioned a “logic error” without providing specifics. However, cybersecurity experts have suggested different theories about the issue. Some believed the update contained only zeros, but that theory was later dismissed. Others suggested the problem was linked to a null pointer from the C++.
An apparently colossal CrowdStrike error. All the above suggests that the company failed to properly check the update to its definition file, which led to a major security flaw when executed in kernel mode.
This is concerning because CrowdStrike’s security solutions are used on critical servers. Now the question is whether this problem will inspire cybercriminals to try and exploit similar vulnerabilities in Windows systems.
This article was written by Javier Pastor and originally published in Spanish on Xataka.
Image | Dave’s Garage
Related | How to Fix the CrowdStrike and Microsoft Bug on an Affected Computer
View 0 comments