News
        
        Amazon Provides Explanation for Cloud Outage in Dublin Last Week
        
        
        
        
		In a detailed post-mortem, which included an apology, Amazon Web Services is giving reasons for last week's massive cloud outage that hit its Dublin data center.
Originally thought to be a lightning strike, Amazon said it  is not clear what caused the failure of a transformer that led to a power  outage in the datacenter on August 7. But in any case, the subsequent malfunction  of a programmable logic controller (PLC), which is designed to ensure  synchronization between generators, led to the failure of the cutover of a  backup generator. 
It all went downhill from there. Without utility power, and  the backup generators disabled, there wasn't enough power for all the servers  in the Availability Zone to continue operating, Amazon said. The  uninterruptable power supplies (UPSs) also quickly drained, resulting in power  loss to most of the EC2 instances and 58 percent of the Elastic Block Storage  (EBS) volumes in the Availability Zone.
Power was also lost to the EC2 networking gear that connects  the Availability Zone to the Internet and to other Amazon Availability Zones.  That resulted in further connectivity issues that led to errors when customers  targeted API requests to the impacted Availability Zone. 
Ultimately Amazon was able to bring some of the backup  generators online manually, which restored power to many of the EC2 instances  and EBS volumes but it took longer to resume power to the networking devices. 
Restoration of EBS took longer due to the atypically large  number of EBS volumes that lost power. There wasn't enough spare capacity to  support re-mirroring, Amazon said. That required Amazon to truck in more  servers, which was a logistical problem as it was night time. 
Another problem: When EC2 instances and all nodes containing  EBS volume replicas concurrently lost power, Amazon said it couldn't verify  that all of the writes to all of the nodes were "completely consistent." That  being the case, the assumption was that the volume was in an inconsistent  state, even though the volumes may have actually been consistent.
"Bringing a volume back in an inconsistent state without the  customer being aware could cause undetectable, latent data corruption issues  which could trigger a serious impact later," Amazon said.  "For the volumes we assumed were inconsistent,  we produced a recovery snapshot to enable customers to create a new volume and  check its consistency before trying to use it. The process of producing  recovery snapshots was time-consuming because we had to first copy all of the  data from each node to Amazon Simple Storage Service (Amazon S3), process that  data to turn it into the snapshot storage format, and re-copy the data to make  it accessible from a customer's account. Many of the volumes contained a lot of  data (EBS volumes can hold as much as 1 TB per volume)."
It took until Aug. 10 to have 98 percent of  the recovery snapshots available, Amazon said, with the remaining ones  requiring manual intervention. The power outage also had a significant impact  on Amazon's Relational Database Service (RDS). 
Furthermore, Amazon engineers discovered a bug in the EBS  software that was unrelated to the power outage that affected the cleanup of  snapshots. 
So what is Amazon going to do to prevent a repeat of last  week's events? 
For one, the company is providing to add redundancy and  greater isolation of its PLCs "so they are insulated from other failures."  Amazon said it is working with its vendors to deploy isolated backup PLCs. "We  will deploy this as rapidly as possible," the company said. 
Amazon also said it will implement better load balancing to  take failed API management hosts out of production. And for EBS, the company  said it will "drastically reduce the long recovery time required to recover  stuck or inconsistent EBS volumes" during a major disruption. 
During Amazon's last major outage in late April, the company received  a lot of heat for not providing better communications. "Based on prior  customer feedback, we communicated more frequently during this event on our Service Health Dashboard than we had  in other prior events, we had evangelists tweet links to key early dashboard updates,  we staffed up our AWS support team to handle much higher forum and premium support  contacts, and we tried to give an approximate time-frame early on for when the  people with extra-long delays could expect to start seeing recovery." the  company said. 
For those awaiting recovery of snapshots, Amazon said it did  not know how long the process would take "or we would have shared it." To  improve communications, Amazon indicated it will expedite the staffing of the  support team in the early hours of an event and will aim to make it easier for  customers and Amazon to determine if their resources have been impacted.
Amazon said it will issue a 10-day credit equal to 100  percent of their usage of Elastic Block Storage volumes, EC2 instances and RDS  database instances that were running in the affected Availability Zone in the Dublin datacenter. 
Moreover customers impacted by the EBS software bug that  deleted blocks in their snapshots will receive a 30 day credit for 100 percent  of their EBS usage in the Dublin  region.  Those customers will also have  access to the company's Premium Support Engineers if they still require help  recovering from the outage, Amazon said.
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.