Mean Time To Innocence: A Primer for Admins -- Virtualization Review

Mean Time To Innocence: A Primer for Admins

Proper monitoring helps prove that it's not your fault, and find out what, exactly, is to blame.

By Trevor Pott
03/27/2017

When something goes wrong in the cloud, it might not be a system administrator's fault, but it absolutely is their responsibility. End users don't care where an application or service is hosted; if it involves a computer -- even tangentially -- it's IT's problem. Dealing with this means investing in decent tools.

I have personally run into this issue recently with a client. For years the client has used an application to send images from their site to a supplier site for processing. They would send individual orders of 10 images, maybe 100 images. Transmission of these orders using the DOCSIS 3 "business cable" broadband package they have took a few minutes to, at most, an hour.

Whenever an order took an hour to send, there'd be grumbling, but this was still accepted as normal, especially during "silly season" when six or seven individuals would be sending up orders. Recently, the client decided to make a fairly dramatic change to their business practices, without considering the impact it would have on their IT at all.

The client decided that in the name of efficiency they would send their images out to a third-party service for retouch instead of doing it in-house. The third-party service could do it cheaper and more efficiently than they could in-house, and they even managed to integrate right in with the same image-handling application my client uses for all their other work.

At first blush, this seems rational. Regular readers will, I'm sure, be able to guess what happened next.

Big Package, Small Pipe
Because images were now cheap to retouch, the client was eager to ship them off in bulk. When they got the images back, they could submit onsie and twosie orders, as per normal, all properly rendered to size for printing. So the client sends their first order off for retouching: 3,450 images at an average of 11MB each, or a little north of 37GB of data.

Through a cable modem.

A couple of days after this started I got an angry e-mail from the client saying that something must be horribly wrong with their network. They've tried multiple times to send this order, but after an hour of transmission, they see at most 2 percent of the order having sent. They re-try sending, but it never seems to help. Why, they ask, is everything broken?

The customer has had various network problems over the years. Applications can hog bandwidth. Someone tries watching the playoffs in the browser of their VDI instance. Once, a spambot got installed on someone's desktop and hogged all the upstream. These things happen.

I walked through the process of verifying that everything on the network was as it should be: nothing was flattening the connection, VDI seemed smooth and responsive, none of the primitive tripwires I'd set up had been triggered... and then another e-mail came in.

Citing deep frustration, the client posted the image, showing just what it was they were actually trying to do, revealing their new business practice and remarkable upload attempt. I informed them that, all things being equal, their Internet connection was capable of uploading about 1GB per hour and that the upload they were attempting would probably take about two days.

This was not considered an acceptable response.

Users Don't Think Like Admins
In the mind of the user, "uploading an order" took between a few minutes and an hour. The size of the order didn't factor in to their calculations, nor to whom they were sending it. Speed had something to do with "the network" and was may responsibility. To the client, Internet connectivity is Internet connectivity, and they pay for the fastest connection the local ISP will offer.

There's no reason or explanation that the client can understand for why an upload might take two days. It isn't how they want this to work. It isn't how they planned on it working, and if it keeps working like this, it could jeopardize a cost reduction project they've been working on for months.

And we haven't even touched on the bit where the service they're sending their image to rate-limited individual inbound connections, meaning that no matter how big a pipe the customer purchased, those image transmissions will take just exactly as long as they currently take.

Monitoring for the Cloudy Era
It's relatively easy for a virtualization administrator to collect statistics on network and storage utilization, storage IOPS, server overhead and so on, when it's all within the boundaries of their own network. Most of our systems are busy screaming that information out over SNMP for anything that cares to listen.

Proving what's happening at the edge of the network and beyond is a lot more difficult. Try explaining to a customer that actively doesn't want to know about computers why cable modems have upload speeds that vary with the time of day, or that they can't get at their ERP software because Amazon East decided to have a little lie down.

This brings up the importance of next-generation monitoring tools. These are monitoring tools that are SaaS-application-aware, capable of testing network paths across the Internet, are part of larger networks that track Internet activity around the world and can provide that data in a manner customers can understand.

In a cloudy world, the most important concept for systems administrators is mean time to innocence. If we can prove that whatever is broken isn't our doing, then we have the data to determine where things did go wrong. This helps is find out who we should be contacting to get it fixed, and should provide the information they'll need in order to fix it.

In a cloudy world, a goodly portion of systems administration moves from being the ones pulling the levers and twiddling the knobs to solve the problem over to being “ever so slightly more professional and polite” finger pointers. We collect data so that others can fix problems, or to demonstrate to customers that they can't change the truth by ordering it so.

And for the customer trying to upload 37GB over a cable modem? Turns out it's a lot less frustrating -- and goes a lot faster -- if they start the upload when they leave work for the night. The monitoring software shows upload speeds increase dramatically after hours; something the customer never would have believed if it didn't come in pretty graph form.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.