share_log

“史上最大IT事故”初步调查报告:97%受影响系统已恢复,“元凶”竟只是一次常规更新?

Preliminary investigation report of the "largest IT incident in history": 97% of the affected systems have been restored. Was the culprit just a routine update?

cls.cn ·  Jul 26 21:49

① Crowdstrike, the originator of the “biggest IT accident in history” accident, published a preliminary investigation report on the incident on its official website, explaining the root cause of the accident; it was just a “routine operation update” mistake. ② The vast majority of the 8.5 million Windows computers that were paralyzed in the accident have returned to normal operation.

Financial Services, July 26 — Last Friday (July 19), a software update incident at US cybersecurity giant CrowdStrike triggered the “biggest IT outage in history”: 8.5 million Windows computers worldwide crashed. This incident caused IT systems around the world to be disrupted, a large number of flights were grounded, and businesses were shut down.

This Friday, Beijing time, a week after the accident occurred, Crowdstrike, the originator of the accident, published a preliminary investigation report on the accident on its official website, explaining that the root cause of the accident was simply a “routine operation update” mistake.

CrowdStrike also reported that as of July 24 at 5 p.m. Pacific time (July 24, 8 p.m. Beijing time), compared with before the content update, more than 97% of Windows sensors have been back online.

The Windows sensor referred to by CrowdStrike refers to the sensor of Falcon, its network security platform for Windows systems. This means that of the 8.5 million Windows computers that were paralyzed due to the CrowdStrike update accident last week, the vast majority of the computer's Windows system and Falcon network security system have returned to normal operation.

CrowdStrike CEO said on LinkedIn: “We know our work isn't done yet, and we're still committed to restoring every affected system. For customers who are still affected, please know we won't be resting until we fully recover... I'm sorry for the damage caused by this outage and personally apologize to all those affected.”

According to insurance company Parametrix, the IT outage continued for several days, causing losses of around $5.4 billion to Fortune 500 companies worldwide. Since last Friday's accident, CrowdStrike's stock price has dropped by about 25% cumulatively.

Why did Falcon trigger such a serious accident?

Falcon is Crowdstrike's most popular cybersecurity platform product. To better understand this incident, we need to first understand Falcon's defense mechanism.

Falcon is an “Endpoint Detection and Response” (EDR) software. Its role is to use sensors to monitor everything that happens on the computer where it is installed, look for signs of malicious activity, and respond immediately and flexibly.

Take an example. If you compare a computer system to a community, then a traditional firewall is similar to a doorman guarding the gate of a community, and anti-virus software is similar to a community security. They check and identify suspicious people entering the community (especially known bad people) and kick them out of the community. However, they usually only identify threats based on known attack characteristics, and there may be security flaws in the face of advanced and unknown threats.

EDR software such as Falcon is similar to an intelligent monitoring system in a community. Sensors are cameras installed in every corner of the community. They constantly monitor every corner of the community, pay attention to every move of every person in the community, and use artificial intelligence, big data, and other technologies to analyze, judge, and predict threats when they find any suspicious situations (such as seeing someone in the community contact a suspected hacker), and take corresponding measures flexibly and autonomously.

Therefore, EDR software is more capable of defending against cyber threats than traditional network security systems, and the countermeasures it can take in the face of threats are also more flexible and intelligent than traditional network security software.

For example, traditional network security software can usually only isolate or delete infected files when a virus is detected; while EDR software detects that the computer may be communicating with a suspected hacker, it can independently shut down the communication system, or when it finds suspected abnormal operation on a certain system, it can predict threats in advance and raise the monitoring level.

Compared to traditional anti-virus software, Falcon is obviously more comprehensive and intelligent, but at the same time, because it requires extensive detailed monitoring of the computer (including monitoring the communication sent by the computer over the Internet, running programs, open files, etc.), it has access to many internal systems — in other words, Falcon is more closely linked to the Microsoft Windows system, and its system privileges are much higher than traditional network security systems.

Therefore, once EDR software such as Falcon fails, it is more likely to cause the entire Windows system to crash — an example was the widespread collapse of Windows computers around the world last Friday.

CrowdStrike reviews the cause of the incident in detail

One week after the accident, CrowdStrike recently released a preliminary review report explaining the details of the incident.

CrowdStrike wrote in the report that at 12:09 Beijing time on July 19, 2024, CrowdStrike released a Windows sensor content configuration update to collect telemetry data on potential new threat technologies. This update is just a regular operational update for CrowdStrike. According to the official statement, similar updates are made several times a day.

However, unexpectedly, the update caused a collective crash (blue screen crash) of Windows systems online between 12:09 and 13:27 Beijing time. CrowdStrike emphasizes that Mac and Linux hosts are unaffected, as are Windows hosts that are not online or connected during this time.

The crash was triggered due to a flaw in the update content, which was not detected during the Crowdstrike verification check. When the Falcon sensor loaded the updated content, the flaw caused the memory read to go out of bounds, causing Windows to crash.

On July 19 at 13:27 Beijing time, bugs in this content update have been fixed. Systems that are online after this time or that are not connected during this time period are not affected by the crash described above.

CrowdStrike said that in the future, it will strengthen software testing processes, optimize error handling mechanisms, refine deployment strategies, and adopt measures such as third-party verification to prevent similar incidents from happening again.

In addition to the initial incident review report, CrowdStrike promises to publicly release a full root cause analysis once the investigation is complete.

edit/lambor

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to Futu. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.
    Write a comment