Skip to content
Blog
General

Operational Technology (OT) Cybersecurity Resilience: Beyond Compliance, Toward Durability

By Eric Cosman

 

Critical infrastructure facilities run on operational technology that was designed for reliability and longevity, not for an era of persistent adversarial pressure. Protecting these systems has long been seen first as a  defensive exercise. Emphasis has been on building stronger perimeters, restricting access, patching vulnerabilities, and preventing unauthorized activity. The objective was straightforward – keep attackers out.

While this is a laudable objective, the reality is that it is not really achievable for several reasons. The threats and vulnerabilities may not be known, and even if they are, their sheer volume may be difficult or even impossible to address. Also, it is nearly impossible to prevent attacks from an adversary. Protection and prevention are simply not enough.

A Different Perspective on Cybersecurity

The cybersecurity response must accept this reality and focus on resilience as well as prevention.  Resilience is different than being secure. It is the ability of an organization’s systems (e.g., industrial control systems (ICS), SCADA, building automation, and other physical‑process‑controlling technologies) to withstand, respond to, and recover from inevitable cybersecurity incidents while maintaining safe and reliable operations. Essential operations must continue safely and reliably despite cyber threats. This shift in thinking changes how organizations approach cybersecurity investments, architecture decisions, and operational planning.

Consider a simple military analogy. Victory is not achieved by avoiding attacks and repelling invaders. Attacks are inevitable, and those who can survive them and continue to fight will ultimately prevail.

The inevitability of disruption

Incidents have shown that even highly mature organizations can experience cybersecurity incidents. Ransomware attacks have disrupted OT systems around the world. Supply chain compromises have introduced vulnerabilities through trusted software providers. Nation-state actors have demonstrated the ability to target critical infrastructure. Asset owners must therefore assume that failures, compromises, and unexpected events will occur. Resilience begins with accepting this reality and preparing for it.

Rather than asking, “How do we prevent every incident?” resilient OT systems require a different set of questions, beginning with:

  • What happens if a critical system is compromised?
  • Can operations continue safely?
  • How quickly can we detect abnormal conditions?
  • How effectively can we contain the impact?
  • How long would recovery take?
  • What lessons will we learn afterward?

These questions focus on operational outcomes rather than technical controls alone.

Resilience domains

For OT systems, resilience spans five interlocking capability domains, with each building on the previous one.

  1. Visibility – Have a complete, current asset inventory with network topology and change management
  2. Vulnerability management – Risk-informed prioritization; patch if possible, use compensating controls if not.
  3. Protective controls – Examples include network segmentation, DMZ architecture, and remote, multi-factor access.
  4. Detection – Includes continuous passive OT monitoring, SIEM integration, behavioral baselining.
  5. Response and recovery – Includes OT-specific incident response playbooks, tested backups, defined return to operations procedures, and regular exercises.

The first four of these are commonly addressed in most cybersecurity programs. They provide the asset visibility and anomaly detection that make early identification of intrusions possible. But detection is only a portion of the resilience equation, and in most programs, it is what gets funded. The rest involves tested recovery capability, and this is where programs most commonly fall short.

Recovery requires offline configuration backups, tested and validated restoration procedures, defined recovery time and point objectives for each system criticality tier, and a team that has rehearsed the playbook under realistic conditions. Most asset owners have some of these, but few have all of them, and fewer still have tested the end-to-end processes.

You can buy detection. Recovery has to be earned.

It’s more than technology

While technologies such as firewalls, intrusion detection systems, or backup solutions are important, resilience is as much about organizational capability as technical capability. A resilient organization combines people, processes, and technology to maintain operational effectiveness during adverse conditions.

Consider an industrial facility that experiences a ransomware attack affecting engineering workstations. The technology controls may help contain the malware, but the organization’s response will ultimately depend on factors such as whether:

  • personnel understand their roles during an incident,
  • communication channels remain functional,
  • backup procedures have been tested,
  • management can make timely decisions, and
  • operations personnel can safely continue production.

The ability to address these factors can determine whether an incident becomes a minor disruption or a major crisis.

Technology

Resilience must be addressed long before an incident occurs. System architecture decisions can significantly influence how well an organization withstands cyber disruptions. Resilient architectures must incorporate features such as segmentation, redundancy, secure remote access, backup communication paths, and clearly defined trust boundaries. The goal is not necessarily to eliminate every failure, but to ensure that failures remain manageable. Resilient OT architectures assume that incidents will occur and seek to prevent localized problems from becoming enterprise-wide disruptions.

People and process

In many cases, organizational preparedness matters as much as technical preparedness. Technology may enable resilience, but people ultimately deliver it. Operators, engineers, maintenance personnel, cybersecurity professionals, and leadership teams all play critical roles during a cyber event. Their ability to collaborate effectively often determines the outcome.

Mature organizations invest heavily in training, exercises, and cross-functional planning. Cybersecurity incidents rarely fit neatly within organizational boundaries. Successful response requires coordination across disciplines that may not routinely work together. Organizations that conduct tabletop exercises frequently discover that communication gaps, unclear responsibilities, and decision-making bottlenecks present greater risks than technical vulnerabilities.

Resilience is strengthened when people understand not only their own responsibilities but also how their actions affect the broader operation.

Resilience in practice

The value of resilience is most obvious when preventive measures fail. In nearly every major incident, asset owners find that the decisive factor was not whether attackers gained access, but how effectively the organization responded after access was achieved. This is why leading industrial organizations increasingly view cybersecurity as an operational resilience challenge rather than solely a security challenge.

Several well-known incidents demonstrate this:

  • Triton / Trisis (2017) – Attackers targeted a safety instrumented system (SIS) at a petrochemical facility. The attack triggered abnormal conditions that led to an operational shutdown before physical consequences occurred. The facility had multiple layers of protection. The most important demonstration of the value of resilience was the operation of independent safety mechanisms.
  • NotPetya (2017) – Several major manufacturers experienced significant operational outages, with some facilities temporarily losing production capability because business and operational systems became unavailable simultaneously. Better resilience might have limited propagation across sites, reduced restoration time from weeks to days, protected critical production environments from enterprise disruptions, or reduced financial losses.
  • Colonial Pipeline (2021) – The important lesson was that a compromise of business systems resulted in the shutdown of a critical physical operation. Improved resilience may have reduced the duration of the shutdown, reduced fuel supply disruptions, increased confidence in operational continuity, or reduced economic and reputational impacts.

Final Thoughts

Recovery Is a Competitive Advantage. While prevention is often the primary focus, recovery is often where this resilience is truly demonstrated. The ability to restore operations quickly results in less downtime, lower financial losses, and reduced operational risk. The asset owner is also better positioned to maintain customer confidence and regulatory trust.

Recovery procedures must be documented and practiced. Periodic testing under realistic conditions is critical. Just as emergency shutdown systems are routinely tested, cybersecurity recovery capabilities should be exercised before they are needed. When a major incident occurs, it is too late to discover that backups are incomplete or recovery procedures are outdated.

Resilience requires a broader perspective and mindset that acknowledges that cybersecurity is not simply about keeping adversaries out of networks. It is about ensuring that industrial operations can continue safely, reliably, and effectively in the face of uncertainty. Cybersecurity is an operational capability, a leadership responsibility, and increasingly a business imperative. In an era of expanding connectivity and evolving cyber threats, resilience may ultimately become the most important measure of cybersecurity success.

 

Want to learn more about OT resilience? Download the 30+ page OT resilience handbook

Back To Top