The Cranky Admin

How To Survive the Next Amazonpocalypse

There are numerous cloud options out there. Learn what they are.

In case you somehow slept through Feb. 28, 2017, Amazon's public cloud had a major outage that succeeded in "breaking" large portions of the Internet.  Breathless reporting from all points has concentrated on pointing fingers of blame, or tallying the list of those affected. Here at Virtualization Review, however, we focus on more practical concerns affecting every day IT practitioners. Practical concerns, like "I told you so."

If you're reading this blog on Virtualization Review, there's a pretty good chance you know your hypervisors from your bare metal and your firewalls from your reverse proxies. I don't need to tell any of you why Amazon's outage happened, or even that it was inevitable. You all knew that already. You are the ones saying "I told you so."

Our bosses turn to us to keep IT running. We in turn rely on hardware and software vendors, service providers and public cloud providers. Our job is to find the right mix to meet the needs we're confronted with. Sometimes we don't call it right, and that's on us. All too often, however, someone higher up the food chain overrules us, and that's on them.

You Can't Always Get What You Want
The problem with "I told you so"'s is that they don't really help you get what you want. Rubbing some empty suit's nose in the fact that they decided to put all their eggs on a basket run by a company that built a pan-global empire out of grinding its suppliers down to the lowest possible bidder will just make said suit defensive. Nobody likes being confronted with their mistakes.

There are alternative approaches to making the infrastructure under our care more resilient, even if they're not nearly as satisfying. The first thing we should all be doing is gathering banners for the next battle. There are excellent post-Amazonpocalypse analyses, such as this one by the inestimable Dan Kusnetzky, who has weighed in on disaster planning aspects. Curating this sort of work from across the Web will help us make our case the next time some pointy-haired boss Dilberts us with a "why isn't this in the cloud yet?"

Of course, the world isn't so simple that the answer to Amazon's outage is some sort of knee-jerk "the cloud is bad." Yes, an outage occurred, but all the same reasons why the cloud was worth consideration in the first place still apply. We do, however, have to change the questions we ask, and of whom.

Assessing the Options
If you want to build a resilient and highly-available IT solution, there are three main routes to victory. You can: 1) control all aspects of the solution yourself, which typically means controlling at least two physical sites, 2) go the hybrid route where one site is owned and operated by someone else, or 3) fully outsource everything to, for example, a public cloud provider.

Controlling the whole thing yourself is considered "old school" today. This is actually rational. Unless your organization is large enough to justify having multiple datacenters in different geographical locations, using colocation facilities to house gear for your second site, or using a public cloud provider as a disaster recovery location is the most economic and pragmatic approach. Both are plentiful, and you can mix and match to meet your needs.

These hybrid solutions are the new normal. Keep some workloads (and associated data) onsite, run the rest elsewhere, and have backups run across the multiple points of presence. If something goes splork, you can flip the switch, light up the backups and continue on your merry.

Of course, public cloud providers claim to offer all of this -- and more -- within their own infrastructure. The big four have multiple regions, numerous Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) options, backups, snapshots and so on. As Amazon demonstrated so ably, however, if the control plane goes out underneath it all, things can break so bad that even Amazon can't get into its own dashboard to update outages.

This brings us to the world of cloud brokers. The chances of one public cloud provider having an outage aren't very high. The chances of multiple public cloud providers having outages at the same time could politely be called statistically unlikely. The chances that major public cloud providers and regional service provider offerings all go down together rest in the "well, the world is pretty much taking that day off anyway, so why worry about it" realm.

System administrators have options for resiliency and high availability, both on-site and off. Checking that these have been taken advantage of -- and testing to ensure they continue to be operational -- is where the real challenge lies.

Vetting Virtual Vendor Veracity
With hybrid solutions becoming the norm, some part of our infrastructure moves beyond out control. This could be as simple as trusting a colocation facility to house some of our servers, or it could be full-blown hybrid cloud solutions, such as Microsoft's Azure Stack.

Before engaging these solutions, we need to verify that they do what it says on the tin. We need to ensure that the portions of the solution we don't control are fit for purpose and we need to establish some means of regularly auditing everything to ensure that corners weren't cut when we weren't looking.

Hybrid solutions require we not only audit the offsite elements regularly, but also the software responsible for making sure that A connects to B and data gets where it's supposed to go. Patches can break things just as easily as a remote service provider suffers a "backhoe vs. fiber optic" incident. Trust needs to be earned, not given away freely.

Where things get a lot more difficult for us is Software-as-a-Service providers (SaaS). Unlike IaaS or PaaS, where we can engage with cloud brokers on our own to mitigate the risks of a public cloud provider outage, SaaS providers merely provide an Internet-delivered application. We don't get the luxury of using a cloud broker with them.

Applications, Not Infrastructure
Here is where it's most important to tread carefully. Business processes are built around applications, not infrastructure. Applications are vital, hard to migrate away from and provide lock-in that can go to the very core of an organization.

Before engaging with a SaaS provider, it's our job as system administrators to carefully vet them. It's our responsibility to make sure that they have taken appropriate steps to make the infrastructure they employ resilient and highly available. It's our job to regularly audit this and to raise the alarm if something changes.

The sad reality is that we won't always be listened to. Vendors using public cloud infrastructure as their backend sell lies about uptime and misleading marketing about reliability. That's life.

What we can do is take opportunities like this recent AWS outage to gather banners and make sure we're prepared for the next argument, should that need arise. We can and should develop objective metrics that we want SaaS providers to meet before we engage with them, and a means of scoring those providers when our counsel is sought before relying on one.

The Alternative View
Walk into the meeting and tell them that you never want to say "I told you so." Offer alternatives. And if you're large enough, negotiate with the SaaS vendor. It may just be that they themselves weren't aware of how they could improve their offering.

In a world of public cloud solutions, this sober second sight is what we get paid for.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.

Featured

Subscribe on YouTube