The Cranky Admin
IT Monitoring in a Hybrid Cloud World
Tracking what is going where now requires a completely different strategy.
One of the Web sites for which I am responsible is down. Determining why it's down is a bit of a journey. Just 10 years ago, figuring out what had gone wrong, fixing the problem and altering procedures to prevent recurrence would have been relatively easy. Today, however, hybrid IT is the new normal, and solving these sorts of problems can be quite complex.
Ten years ago I had all my clients hosting their Web sites on their own servers. On behalf of my clients, we ran email servers, DNS servers, caching, load balancing, intrusion detection, front-end, database, a box full of crazed squirrels, you name it. None of the datacenters I oversee are large, but at their peak, several of them ran a few thousand workloads.
This was in the days before desired state configuration and the "pets vs. cattle" debate. There were a lot of pets in these datacenters.
Pets and CattleA typical datacenter for me had some fibre, a backup VDSL connection, and only a handful of public-facing workloads. Not a lot of upstream, a whole lot of downstream. We'd have redundant air conditioning with outside air, UPSes, compute, storage and networking; the really rich folks would have a generator. Five or six of these would end up being a lot of work for two sysadmins.
As you can imagine, workloads were sent into the public cloud. Web-facing stuff first, because it had a lot of infrastructure "baggage." Ever more mission-critical workloads moved until -- seemingly without anyone noticing -- the on-premises datacenter, the hosted solutions at our local service provider, and the public cloud workloads were scattered about the continent.
Despite the geographic dispersal of workloads amongst various providers, however, any given client's workloads remained critically conjoined. What was out in the public cloud fed into the on-premises systems, and everything had to be synchronized to the hosted systems for backups. If the wrong bit fell over, everything could go sideways.
Tracing a Web Site Outage
The first problem with my Web site outage is that I didn't notice the outage; a customer did. It is embarrassing, but also an important bit of information. Either my monitoring software couldn't detect the problem, my monitoring software couldn't alert me to the problem, or my monitoring software is also down.
This could be useful additional diagnostic information for me, or a separate fire to put out. I won't know until I'm a little further down the rabbit hole, but it is troubling.
Having spent years with pre-virtualized one-application-per-metal-box workloads, whenever something stops working my first instinct is to look for hardware failure. Today, that would mean seeing if the virtual servers, hosting provider or public cloud had fallen over.
A quick look see shows that I can connect to all the relevant management portals; the various management portals claim all the workloads are up and running. Unfortunately, I can't seem to log in to any of these workloads using SSH. This is alarming.
The hosting provider gives me console access to workloads -- something that, sadly, my public cloud provider does not -- and I am able to quickly assess that the various Web site-related workloads are up and running, have Internet access, and otherwise seem healthy, happy and enjoying life. They are not currently handing customers, which means that the switchover mechanism believes the primary workloads are still active.
I get an email on my phone, so something has to be working with the public cloud hosted workloads; part of the mobile email service chain lives there. I hop on Slack and ask a few of my sysadmin buddies to test my Web site. Some of them can get there, some of them can't.
While I pour coffee into my face and curse the very concept of 6 a.m., a phone call comes in from a panicked sales manager: only orders from one specific Web site have showed up in the points of sale system overnight. Five other Web sites haven't logged a single order.
Rather than drag you through each troubleshooting stage, I'll jump right to the end: the answer was DNS. More specifically, the outsourced DNS provider had a really interesting oopsie where half of their resolvers wouldn't resolve half of our domain names and the other half worked perfectly. This broke nearly everything, and we weren't prepared for it.
Old Monitoring in a New World
For most of our customers, the monitoring suite lives with the hosted provider. There, it's just one more VM on a box full of various VMs. By living in what amounts to the backup location, it's also not part of either of the primary production infrastructures (the public cloud and the on-premises datacenter), so it seems like a reasonable place to put a widget that would check to make sure both of those sets of workloads were up and public-facing.
In the case of my early morning outage, because there was not actually anything wrong with the Web site, and the hosting provider provides a caching DNS server, the monitoring solution didn't see anything wrong. It could resolve domain names, get to the relevant Web sites, see email passing and so forth.
Back in the day when everything ran from a single site, this was fine. Either things worked, or they didn't. If they didn't work, wait a given number of minutes, then flip over to the disaster recovery site. Life was simple.
Today, however, there are so many links in the chain that we have to change how we monitor them. DNS, for example, clearly needs to be monitored from multiple points around the world so that we can ensure that resolution doesn't become split-brained. Currently none of our customers use geo-DNS-based content delivery for network-based regional Web site delivery, but it's been discussed. That would add yet another layer of monitoring complexity, but this sort of design work can't be ignored.
Hybrid IT Is the New Normal
None of my clients are doing anything that, by today's standards, is particularly novel or difficult. Web sites hosted in the cloud update the point of sale software on-premises. They receive updates from that same software. Orders for some customers arrive via FTP or SSH and are funneled to the on-premises servers for processing.
There is middleware that collects order tracking information from manufacturing, invoicing from points of sale, information from the e-stores and logistics information from the couriers. All of this is wrapped up and sent to customers in various forms: there are emails, desktop and mobile Web sites and SMS pushes. I think one client even has a mobile app. The middleware also tracks some advertising data from ad networks and generates reports.
Somewhere in there is email. Inbound email goes through some hosted anti-spam and security solutions. Outbound email comes from dozens of different pieces of software that will forward through smart hosts at various points until they are funneled through the main server located in the cloud. Email can originate from end users or from office printers, manufacturing equipment, the SIP phone system or any of dozens of other bits of machinery.
None of the clients I act as sysadmin for are currently more than 200 users. Most are in the 50-user range. None of the technology they have deployed is even as complicated as a hybrid Exchange setup or hybrid Active Directory.
Despite this, these small businesses are thoroughly enmeshed in hybrid IT. This multi-site, multi-provider technological interconnectivity means changing how we think about monitoring.
Hybrid IT is not a novelty. It's not tomorrow's technology. It's the everyday business of everyday companies, right now, today. Are you ready?
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.