Tuesday, March 29, 2022

Why the SOC Needs to Learn from the Aviation Industry

The cybersecurity industry has spent a lot of time talking about improving the analyst experience while not making significant improvements, as much of the efforts have been too focused on finding a silver bullet solution. Combine that with a global pandemic and now things are just getting worse. A recent study published by Devo, the 2021 SOC Performance Report, found that on a 10-point scale, where 10 indicates SOC staff have a “very painful” experience performing their jobs, 72% of respondents rated the pain of SOC analysts at a 7 or above.


Instead of thinking about the aforementioned silver bullet to alleviating SOC pain, I wanted to focus on one of the top sources, alert fatigue, and how the cybersecurity industry might be able to take a page out of another field to find a solution.

In the SOC Performance Report, a whopping 61% said a cause of SOC pain was that there are too many alerts to chase. I think it’s safe to draw the connection that “alert fatigue” will expand to “posture fatigue” and “policy fatigue,” as it adversely affects both recruitment and all too critical retention of experienced SOC professionals.

Alert fatigue may exit the aircraft

So, if we can’t figure out within the security industry, let’s learn from others. There are many non-cyber industries and professions that suffer similarly with alert fatigue, and perhaps the cybersecurity industry can reapply some of those learnings. Across these compatriots of alert fatigue, if we ask the question “how do alarms, warnings, and alerts differ?” I think we’ll find much similarity and overlap in answers — in both the theory and practice of how human operators are supposed to respond and how they do so in reality.

For the purpose of this article, I want to take a look at the aviation industry as our example to the SOC. They have navigated many of the problems SOC operators face today and have made the most progress in governing and managing the ergonomics of sensory overload and automation. Picture this: the inside of an airplane cockpit with all its knobs, buttons, lights, and alerts isn’t too dissimilar to the combined dashboards SOC analysts have to navigate when triaging, investigating, and responding to threats.

In 1988, The Washington Post reported on a “glass cockpit” syndrome in the aviation industry, that reads eerily similar to what many say or think about the SOC today. Researchers from the American Psychological Association noted that pilots would “fall victim to information overload and ignore the many bits of data pouring from myriad technical systems,” and that in airline crashes they studied it was found that “black box recordings showed that the crews talked about ‘how the systems sure were screwed up’ but did not verify what was wrong. In both cases, the systems worked but crews failed to check the information and crashed.”

Similarly, research published in 2001 by the Royal Institute of Technology examined “the alarm problem” in aviation, meaning, “in the most critical situations with the highest cognitive load for the pilots, the technology lets you down.” The reports noted that “the warning system of the modern cockpits are not always easy to use and understand. The tendency is to overload the display with warnings, cautions and inoperative system information accompanied by various audio warnings.” It went on to identify one of the main problems as a result of this overload as “a cognitive problem of understanding and evaluating from the displayed information which is the original fault and which are the consecutive faults.” Sound familiar? You would likely hear something extremely similar from someone working in today’s SOC.

In the decades that followed, aircraft cockpit design has progressively applied new learnings and automation to dynamically manage alert volume and the attention of the pilot to priorities. In the Royal Institute of Technology’s report, researchers identified accident simulation as an effective tool for improving cockpit alert systems, finding more associable ways to present alerts such as differentiating sounds and the introduction of context, which would allow pilots to “immediately understand what part or function of the aircraft is suffering a malfunction.” More context would also include guidance on what to do next. In its conclusion the study noted:

Such simulations would hopefully result in less cognitive stress on behalf of the pilots: they would know that they have started to solve the right problem. They would not have to worry that they have entered the checklist at the wrong place. With a less stressful situation even during malfunctions there is greater hope for correct actions being taken, leading to increased flight safety.

SOC systems need to embrace and apply many of these same learnings that have spanned decades for aviation. The majority of the cybersecurity industry seems to have only gotten as far as color coding alert and warning significance, leaving the analyst faced with a hundred flashing red priorities, even after triaging it. It’s no surprise that they’re both overwhelmed and unable to respond to complex threats across a broadening attack surface.

Beware of Autopilot

When it comes to solving the issue of alert fatigue, automation is typically one of the first things to come to mind. The same went for aviation in 1988, where the previously mentioned Washington Post report quoted researchers saying what could have been taken right from a security trade publication in 2022:

Research is badly needed to understand just how much automation to introduce — and when to introduce it — in situations where the ultimate control and responsibility must rest with human operators, said psychologist Richard Pew, manager of the experimental psychology department at BBN Systems and Technologies Corp. in Cambridge, Mass.

“Everywhere we look we see the increasing use of technology,” Pew said. “In those situations where the operator has to remain in control, I think that we have to be very careful about how much automation we add.”

The growing use of high-tech devices in the cockpit or on ships can have two seemingly contradictory effects. One response is to lull crew members into a false sense of security. They “regard the computer’s recommendation as more authoritative than is warranted,” Pew said. “They tend to rely on the system and take a less active role in control.” Sometimes crews are so mesmerized by technological hardware that they are lulled into what University of Texas psychologist Robert Helmreich calls “automation complacency.”

And while automation of course has an important part to play in incident response and investigation — just as it does in modern aircraft cockpit design — it comes with some key warnings:

  1. Situational awareness is lost. Automation is often brittle, unable to operate outside of the situations it is programmed for, and subject to inappropriate performance due to faulty sensors or limited knowledge about a situation.
  2. Automation creates high workload spikes (such as when routine changes or a problem occurs) and long periods of boredom (in which attention wavers and response to exceptions may be missed). If you’re staffing for automation-level activities, how do you manage capacity for spikes?

The SOC Earns its Wings

As an industry we have to take a page from the aircraft handbook and avoid increasing cognitive demands, workload and distractions, and make tasks easier to perform. But we must also understand how to manage automation failure and exceptions better.

  • Embrace AI and autocomplete: Like the more advanced sentence autocomplete functions appearing in email and word processing applications, SOC analysts are still in charge of managing an incident, but there is an opportunity to further guide and preemptively enrich a threat investigation, thereby increasing the speed and robustness of response.
  • Distill and prioritize at the incident level, not the alert level: It’s not about filtering/correlating/aggregating alerts, it’s about contextualizing both events and alerts in the background and only articulating an incident in plain single-sentence language. Analysts can double-click down from there.
  • Leverage a community of experts: As attack surfaces increase and vertical technology specialization becomes tougher for in-house SOCs to cover (particularly in times of competing incident prioritization), it becomes increasingly important to be able to “phone-a-friend” and access an on-demand global pool of expert talent. It’s like having several Boeing engineers sitting in the cockpit with the pilot to troubleshoot a problem with the plane.

-- Gunter Ollmann

First Published: Medium - March 29, 2022