Server Reboot Crashes Joyent Cloud -- Virtualization Review

The Cloud Report

Server Reboot Crashes Joyent Cloud

More on this Topic:

Joyent Explains How 'Fat-Finger' Typo Crashed Its Cloud

You can implement all the security precautions you want, but data center and cloud outages are often just accidents from within, as happened yesterday when an operator's "fat finger" brought down a data center operated by cloud provider Joyent Inc.

"Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted," the company said in a support notification. The problem was apparently cleared up in about an hour, with the company providing a 6:45 p.m. EDT update that all compute nodes and virtual machines were back online.

The "high-performance cloud infrastructure and Big Data analytics company" operates three data centers in the U.S., including the one that went down in northern Virginia. The others are in the Bay Area and Las Vegas. The company lists nearly 30 corporate customers on its Web site, including ModCloth, Voxer, Wanelo, Quizlet and Digital Chocolate.

"It should go without saying that we're mortified by this," said Joyent CTO Bryan Cantrill in a post on Hacker News. "While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a data center."

Although the exact duration of the outage wasn't specified, the recovery time was apparently extended more than hoped. "Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time," the initial support notification said.

Cantrill apparently wasn't happy with the recovery process.

"As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope)," he said in his Hacker News post.

Yesterday's outage wasn't the first experienced by the company, as a Hacker News poster complained that Joyent's server where he hosted some Web sites and mail servers was down for two days in February 2011.

Despite all the attention paid since then to high availability, redundancy and failover systems, notable cloud outages continue to occur. The Joyent outage happened less than two weeks after Adobe System Inc.'s Creative Cloud services were down for about a day because of a database maintenance error, leaving designers unable to access their tools (Adobe had announced the previous year it was discontinuing packaged software or downloaded programs in favor of cloud-only services).

And although neither that incident nor the Joyent outage were as serious as well-publicized security breaches, such as one experienced by Adobe last October when 38 million accounts were compromised -- they continue to frustrate users and damage the reputations of individual cloud providers and the cloud-hosting industry in general.

For example, in February of last year, Microsoft's Azure cloud platform suffered a worldwide storage outage attributed to an expired SSL certificate. Some eight months later, another Azure outage of more than 20 hours was attributed to a system subcomponent.

And just last week, Microsoft experienced a less severe issue in which some customers of its compute cloud service apparently experienced access problems, according to information on the company's Azure Service Dashboard. No causal details were given for the incident, described as a "performance degradation" rather than a service interruption, but it indicates problems of all kinds may be going unreported in mainstream media.

We'll wait to get more details on the Joyent outage and steps that can be taken to prevent such problems, but they will certainly continue. As one commenter on Hacker News said about Cantrill's post: "[Stuff] happens. You deal with it, then do what you can to keep it from happening again."

Then some more stuff happens.

Posted by David Ramel on 05/28/2014 at 9:44 AM