Joyent Explains How 'Fat-Finger' Typo Crashed Its Cloud -- Virtualization Review

Joyent Explains How 'Fat-Finger' Typo Crashed Its Cloud

By David Ramel
05/29/2014

As promised, cloud provider Joyent Inc. provided a full explanation of how its eastern U.S. data center crashed Tuesday, putting some customers out of business for more than two hours.

The company said the outage was caused by a series of events and circumstances triggered by an operator's typo while issuing a command to reboot a set of selected systems after an upgrade. Joyent's four data centers all run on the company's own SmartDataCenter product.

"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent explained in a blog post yesterday. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."

Systems came back online piecemeal. While downtime for customers ranged from 20 minutes to 2.5 hours, 90 percent were back online in less than an hour. "The extended API outage time was due to the need for manual recovery of stateful components in the control plane," Joyent said in its extremely detailed statement. "While the system is designed to handle 2F+1 failures of any stateful system, rebooting the data center resulted in complete failure of all these components, and they did not maintain enough history to bring themselves online. This is partially by design, as we would rather systems be unavailable than come up 'split brain' and suffer data loss as a result. That said, we have identified several ways we can make this recovery much faster."

Folks using beta of Cage, Joyent is having some downtime at the moment, but they're on it: https://t.co/VA5DD7KUc2
— Cage (@cageapp) May 27, 2014

Some machines came online later because of a known, transient bug in a legacy network card driver that causes reboots to fail about 10 percent of the time. The company said it will assess a faster migration of customers from legacy systems to newer equipment -- the third step of a three-part effort to prevent the failure from repeating and ensuring that any other future downtimes recover more quickly.

"First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously," the company said. "Secondly, we are determining what extra steps in the control plane recovery can be done such that we can safely reboot all nodes simultaneously without waiting for operator intervention. We will not be able to serve requests during a complete outage, but we will ensure that we can record state in each node such that we can recover without human intervention."

After the outage, company CTO Bryan Cantrill posted on Hacker News an initial explanation of the incident. The post garnered 126 comments, one of which stated: "Back in the day we used to say there are two types of network engineers, those that have dropped a backbone and those that will drop a backbone."

Several commenters and other social media posts expressed concern for the operator who made the mistake, but Cantrill said in one article that the person wouldn't be punished, as the person was "mortified" and nothing could be done to make the person feel worse -- which the company wouldn't want to try to do anyway.