In-Depth

Cloud Resiliency Expert Dives Deep into Chaos Engineering and Chaos Monkey

Modern cloud environments are built on distributed services, APIs, cloud provider dependencies, identity systems, containers and automation layers that can fail in ways that are difficult to predict. For IT teams responsible for resilience, the challenge is not just recovering from outages or attacks after they happen, but understanding how systems and response teams behave under stress before a real incident puts business operations, customer trust and compliance obligations at risk.

At today's online Cyber Resilience for Cloud-Native Infrastructure Summit hosted by Virtualization & Cloud Review , longtime technology author Brien Posey focused much of his session on chaos engineering, including its Netflix origins and that company's Chaos Monkey tool and the operational risks of testing resilience by intentionally creating failures.

"The ultimate purpose of chaos engineering is not to create failure, it's to reduce the fear of failure."

Brien Posey, Microsoft MVP

Posey's session, "Expert Strategies: Building Resilient Incident Response for Modern Cloud Environments," was part of the Rubrik-sponsored virtual event, which focused on cloud security posture, hybrid and multicloud resilience, automation, orchestration and proactive recovery planning. It is being made available for on-demand replay at the link above.

Chaos Engineering and Security Resilience
[Click on image for larger view.] Chaos Engineering and Security Resilience (source: Brien Posey).

Posey defined chaos engineering as the practice of deliberately introducing failures into systems to see what breaks, how teams respond and whether recovery mechanisms work as intended. He said the approach is based on the premise that systems will fail eventually, so organizations should practice handling those failures before they occur during real incidents.

"Doing chaos engineering experiments builds real world skills that you just really can't get anywhere else," Posey said during the Q&A, when asked why organizations should run such tests if they can be dangerous.

Posey also tied chaos engineering to a broader incident response theme: cloud incidents often become business incidents. He noted that outages and breaches can affect customer trust, revenue, compliance, operations and reputation, and that response teams may need participation from engineering, security, legal, communications, executive leadership, vendors and cloud providers.

From Netflix to Chaos Monkey
Posey traced the rise of chaos engineering to Netflix, which he said faced growing pains as it scaled a large distributed cloud infrastructure. Servers crashed, networks failed, regions became unstable and dependencies timed out, he said. Netflix's conclusion, according to Posey, was that engineers should not encounter those failure modes for the first time during a live production outage.

What Is Chaos Monkey?
[Click on image for larger view.] What Is Chaos Monkey? (source: Brien Posey).

That led to Chaos Monkey, a Netflix-created tool that Posey described as an automated program designed to implement failures and expose fragility inside the company's infrastructure.

Posey described Chaos Monkey during the Q&A as a Netflix-specific tool rather than a product organizations can simply buy and deploy. Netflix publishes Chaos Monkey as an open source project, but its documentation says this version requires Spinnaker to manage applications and a MySQL-compatible database, and that deployment involves manual build-and-deploy steps.

Posey said Chaos Monkey was designed to terminate production servers and instances so engineers would design systems that could survive infrastructure failure automatically. The point, he said, was not to randomly destroy systems, but to test whether the architecture was resilient enough to withstand component failure.

He also distinguished Chaos Monkey-style testing from unmanaged disruption. Chaos experiments, he said, should be scoped, conducted with engineering awareness and performed in controlled ways. In his view, the goal is confidence building, not destruction.

Why Chaos Engineering Can Backfire
Posey repeatedly warned that chaos engineering is not a starting point for organizations that have not built the necessary operational foundations. Before advanced experiments, he said, teams need strong observability, rollback capability, incident response procedures, automation, reliable monitoring and executive buy-in.

Maturity Prerequisites
[Click on image for larger view.] Maturity Prerequisites (source: Brien Posey).

He emphasized that chaos engineering exposes weakness but does not fix it automatically. Teams still have to analyze the results and make the environment less fragile after the experiment ends.

Posey also warned that security-oriented chaos engineering can carry additional risk. Examples include expiring certificates, revoking credentials, simulating identity and access management compromise, disabling security information and event management pipelines or creating ransomware-like conditions. Those activities can confuse responders, interfere with active incidents or create real attack opportunities if they are handled poorly.

Another risk, Posey said, is what happens when a real incident occurs during a chaos test. Responders may have to determine whether abnormal behavior is expected from the experiment or is evidence of a genuine failure. He called that a "signal collision problem," because the test and the real incident can produce overlapping or conflicting symptoms.

The Biggest Risk: Turning Experiments Into Outages
[Click on image for larger view.] The Biggest Risk: Turning Experiments Into Outages (source: Brien Posey).

Posey said poor implementation can turn chaos engineering into "an incident generator" in its own right. The biggest risk is accidentally turning experiments into outages, especially if tests are poorly scoped, insufficiently monitored or run against fragile systems.

He cited cascading failure as a particular concern in cloud environments. A team may disable one component, only to discover that other services depend on it. The initial failure can then trigger retries, resource exhaustion, database overload, API throttling or other downstream effects.

Start Small, Control the Blast Radius
Posey recommended that organizations start with lab environments, then move gradually to low-risk systems and small, reversible failures. He said organizations should measure carefully, expand gradually and control the blast radius so experiments do not affect broader systems, users or business operations.

"The important thing is to take it slow," Posey said. "Measure everything carefully, expand very gradually, and definitely the most important point on this slide, be mindful of blast radius control, so as to limit the systems affected."

He also recommended building custom automation rather than assuming a third-party tool will behave correctly in every environment. In the Q&A, he said each environment is unique and a tool that works in one environment may not produce the same results in another. He suggested PowerShell as one possible option for building controlled testing tools because of its visibility into operating systems, including non-Windows systems where PowerShell is available.

Best Practices Before Running Experiments
Posey's best-practice guidance centered on preparation and control. He said every experiment should have a kill switch. Teams should define abort conditions before the test starts, such as latency thresholds, error-rate spikes, dependency instability, unrelated alerts or changes in cloud provider health.

Define Clear Abort Conditions Before Starting
[Click on image for larger view.] Define Clear Abort Conditions Before Starting (source: Brien Posey).

He also recommended maintaining experiment visibility so participants know when a test is running, which systems are affected, which symptoms are expected, what the timing is, how large the blast radius could be and how rollback will work. Experiment telemetry should be tagged so alerts generated by a test can be distinguished from unrelated anomalies.

Other recommendations included avoiding experiments during elevated-risk periods such as product launches, peak traffic, migrations, active incidents, severe weather events or periods of cloud provider instability. Posey said teams also need clear incident command authority so someone has explicit power to terminate an experiment and redirect responders if a real incident occurs.

Posey also stressed observability. Teams should have centralized logging, tracing, metrics, dependency mapping and alert correlation before injecting failure into production systems. "You never ever ever want to inject failure into a system that you can't observe," he said.

He recommended game-day exercises before production chaos testing, comparing the approach to rehearsals he experienced while testing commercial spacesuits in zero gravity. Those simulations, he said, can help teams practice communication, escalation, rollback and decision-making before committing to a live experiment.

The Rest of the Session
While chaos engineering was the central focus, Posey also covered related incident response practices. He said documentation alone is not enough if runbooks are not read, tabletop exercises are not updated and playbooks fail under pressure. He also discussed purple team exercises, in which attack and defense teams collaborate rather than operate strictly as adversaries, and he recommended simulated outages and realistic incident response practice.

In the Q&A, Posey said one practical first step for improving incident response across hybrid and multicloud environments is to formalize the incident response procedure and identify who is responsible for what. Asked which metrics teams should track to assess whether cloud resilience and recovery strategies are improving, he named mean time to resolution as one useful measure, while noting that the answer depends on the environment, workloads and tests being run.

Posey closed by framing chaos engineering as a staged maturity model. Early tests might involve killing servers or restarting containers. More mature programs can move into dependency failures such as latency injection, API failures and DNS disruptions, then into regional failure testing and, eventually, organizational scenarios involving simultaneous incidents, conflicting telemetry, communication breakdowns and operational ambiguity.

"The ultimate purpose of chaos engineering is not to create failure, it's to reduce the fear of failure," Posey said. "Organizations become resilient when they practice uncertainty before the uncertainty becomes real."

And More
While replays are convenient and informative -- especially up-to-date sessions that just concluded -- attending live events offers advantages, including the ability to ask specific implementation questions and receive guidance in real time. With that in mind, here are some upcoming online webcasts from Virtualization & Cloud Review:

Featured

Subscribe on YouTube