News

Microsoft: Pandemic Traffic Caused Azure Outage

Microsoft last week confirmed that a March outage affecting Azure cloud services in Europe and the United Kingdom was caused by the COVID-19 pandemic.

Cloud computing platforms and services have become instrumental in the face of a huge surge in remote work and stay-at-home directives that result in people connecting with services such as Microsoft Teams, Skype and so on.

Mostly the platforms have help up well under the new strains, but the Azure cloud experienced a March 24 hiccup.

Azure's Pipelines, providing CI/CD DevOps functionality, suffered a "mid-week slowdown as tech world goes remote" reported DevClass on March 26. Earlier in the month, Microsoft Teams suffered some issues, mainly in Europe, reported ZDNet.

"We have seen a 775 percent increase in Teams' calling and meeting monthly users in a one month period in Italy, where social distancing or shelter in place orders have been enforced," Microsoft said in an March 28 update on cloud services continuity. What's more, "We have seen a very significant spike in Teams usage, and now have more than 44 million daily users. Those users generated over 900 million meeting and calling minutes on Teams daily in a single week."

The Number of Impacted Organizations During Outage
[Click on image for larger view.] The Number of Impacted Organizations During Outage (source: Microsoft).

Last week, Microsoft shed more light on the March 24 European outage in a "post mortem" titled "Hosted Pools Availability Degradation."

From March 24th - 26th, 2020 many customers in Europe and the United Kingdom experienced delays in their builds and releases targeting our hosted Windows and Linux agents. This incident was caused by VM capacity constraints arising from the global health pandemic that led to increased machine reimage times and then increased wait times for available agents. Many customers experienced significant delays in their pipelines over multiple days. We sincerely apologize for the impact of this incident.

The company also acknowledged "our communication during this incident was also problematic."

Microsoft last month said the Azure platform is prioritizing new cloud computing capacity that was added in response.

"Our top priority remains support for critical health and safety organizations and ensuring remote workers stay up and running with the core functionality of Teams," the company said. Azure is providing the highest level of monitoring for the following:

  • First Responders (fire, EMS, and police dispatch systems)
  • Emergency routing and reporting applications
  • Medical supply management and delivery systems
  • Applications to alert emergency response teams for accidents, fires, and other issues
  • Healthbots, health screening applications, and websites
  • Health management applications and record systems

Microsoft is also limiting free offers and some resources for new subscriptions

Last week, after an extremely detailed explanation of the technical factors underlying the outage, Microsoft outlined the steps it's taking "To avoid incidents like this in the future":

  • We are improving our monitoring and alerting by keeping a closer watch on compute allocation failure rates and agent reimage times. This supplements the monitoring and alerting we already had in place for agent availability and pipeline wait times.
  • If we notice impending issues, we will quickly expand our use of ephemeral OS disks for Linux agents, nested virtualization for Windows agents, or both. In the absence of issues, we will continue rolling these changes out slowly, following our normal safe deployment practices.
  • We are improving our live-site processes to ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types."

"Again, we understand the impact that these types of pipeline delays have, especially in combination with a delay in incident communication, and we sincerely apologize for the incident," concluded Chad Kimes, Director of Engineering, the author of last week's post mortem post.

Last month, Microsoft said users can consult the following resources for more information on how Azure is holding up under the new strains:

About the Author

David Ramel is an editor and writer at Converge 360.

Featured

Subscribe on YouTube