The Cranky Admin
Looking at IT Monitoring From Both Sides Now
Today's complex environments require an understanding of the difference between how a user sees a problem and how a sysdmin sees a problem, and how to translate between the two.
For the user, everything is about the app and their experience of using it. Sysadmins have to look at layers or dependencies when troubleshooting. And the problem that prompted the user to call support may not be occurring when a sysadmin goes looking for it, making time‐stamped records important.
This difference between how a user sees a problem and how a sysdmin sees a problem can sometimes be vast, leading to problems in translating between the two. Sysadmins the world over make jokes about how stupid users are. Users the world over think sysadmins are unhelpful, arrogant pains in the neck. Both views are born from a frustration in being unable to bridge the gap in individual experiences, and a lot of that boils down to the tools we use.
For a user, the physical device they use is just a tool. The network is something they may not even know exists. Anything on the other side of the network from their endpoint is a black box they have no interest in understanding.
This is a perfectly rational approach to using a computer. I am not, for example, particularly interested the nitty gritty details of fuel distribution in the planes I fly in, nor do I care about the details of instrument landing system protocols. I get on the plane, it flies to my destination, I get off. It is a tool to me, and nothing more.
I chose the plane instead of the ubiquitous car metaphor for a reason. For many of us, planes are a not-quite-necessary evil. There are almost always alternatives to taking a plane, but those are slower, less efficient and rarely an option for serious business use.
Perhaps more to the point, a lot of users feel about computer support the way most of us feel about traveling by air.
Everyone hates border guards. They're suspicious, overworked, underpaid, undertrained and ultimately capricious. Planes are cramped, awful spaces all on their own. Having to deal with border guards just makes everything about flying that much worse.
Everyone hates sysadmins. They're suspicious, overworked, underpaid, undertrained and ultimately capricious. Computers are bizarre, poorly designed devices all on their own. Having to deal with sysadmins just makes everything about computing that much worse.
Thinking about people as data packets helps. Neither border guards nor sysadmins exist to make the lives of individuals easier. They exist to protect the system as a whole. Individual complaints are diagnostic, but what matters is that the system as a whole keeps running.
If there's a problem, you isolate and remediate. You don't want to turn off the whole datacenter (or airport) unless you absolutely have to. By the same token, it's better to kill an individual workload (or screw up the travel plans of an individual) than to start bogging down the entire system for everyone.
From the sysadmin's point of view, order is required. There is no room for exceptions. You force all workloads to behave according to pre-defined rules and everything works smoothly. Anything that breaks the rules is quarantined and forensically examined. Like controlling people in an airport, systems administration is much easier when one can emit rigid fiats and ruthlessly enforce compliance.
Over the past 20 years, however, a few things have changed.
Two decades of unrelenting fearmongering have convinced the world's citizens that it's perfectly normal and acceptable to turn our airports into a depressing cross between a cattle processing facility and a prison. We willingly surrender our rights, speak in hushed tones, keep our heads down and don't look anyone in the eye.
Those same two decades of unrelenting fearmongering have not, however, convinced users to accept this as regards computers. There's cloud computing out there. Everyone has smartphones. You can connect laptops to the Internet using the cellular network. In other words, there are options available to work around demanding systems administrators; and when pushed, that's exactly what people do.
Choice changes everything, and this is where the air travel analogy breaks down.
Border guards get the option of forcing everyone to comply. In some cases, quite literally at gunpoint. Systems administrators of today, however, don't get to take that easy path of building a system where everything follows the rules and you just look for what's out of bounds if something stops working. In order to keep users using systems as intended, they have to provide systems that users actually want to use.
Reverse the Polarity
In the 90s things involving computers occurred largely consecutively. I wrote a file in a word processor. I saved the file. I then transferred the file somewhere. I opened it on another computer. Backups were regularly done to an attached tape drive.
Unless there was a lot of scripted batch processing occurring in the background, each was a distinct action that I had to consciously perform. This has radically changed today.
When I open my word processor, it checks with my operating system to see what my identity is. This is then verified against a cloud server to make sure that I'm allowed to use that word processor, and may even involve authenticating against that cloud service. As I type, my document is spell- and grammar-checked in real time, with missed words and word selections streamed back to the mothership's big data system. At some point, my word processor updates its checkers with the results of the big data crunching.
My document is also auto-saved regularly. This is kept not on my computer, but on the company server. A copy of each save is sent up to my personal cloud storage, which performs versioning. The server storing the master copy also regularly versions any manual changes to the master file made when I push the "save" button. The master file is backed up every night, and those backups are versioned, with copies of the backups also kept in the cloud.
All of this busywork happens behind the scenes; but for me as an end user, the experience hasn't changed much in decades. I still open a word processor, type my document, save it and close the word processor. The bit that I don't have to worry about is moving it over to a floppy disk: today, my smartphone and my notebooks, all of which are attached to my cloud account, automatically download the latest master copy of the documents I am working on.
The user experience may not have changed much, but the underlying technological components have. In the 90s, creating a document took only my PC. Today the creation of that document also involves several servers and networks, many of which I don't control.
At a minimum there are my file, directory, DNS and backup servers, the cloud servers for authentication, application distribution and patching, Big Data analytics, results distribution, DNS and backup for all of those. My network, the network of the cloud and DNS providers, Internet service providers and all the backbone providers in between are also involved.
The border guard gets to look for anomalies. This approach presumes that the rules are well known and agreed upon. Twenty years ago, this worked in computers; today it doesn't.
Systems administrators can't even see all aspects of the system anymore, and it grows increasingly more dynamic and complex by the day. Continuous delivery, A/B testing and other modern application development and service delivery models mean that it is entirely possible that two users sitting side by side aren't being delivered the same version of an application.
The dynamicity of network routing and DNS, as well as the ubiquity of load balancers, reverse proxies and so forth also mean that the electronic pathways taken by a user to access an application -- or parts of an application -- change. Not only can these vary between users, or between instances of application launch, but they can change while a user is in the middle of using an application.
Imagine trying to keep control of an airport if Gate 23 was Flight 4438 to San Francisco for some passengers, but for others it was Gate 28, and for still others Gate 45. All gates somehow got the passengers to the same flight, but one is just down the hall from the security terminal, one requires passengers to go back outside and then walk across the tarmac and the third requires passengers to dig a tunnel starting in the back of the bathroom.
Oh, and the tickets in passengers' hands can change on the whim of the airline and the airline won't tell the border guards about any changes.
This is IT today. We can no longer reasonably create a model of how things "should" work based on some static blueprint and then find out what's anomalous. This means we need tools that let us see the world from the user's perspective and work backwards.
Profile, Capture and Analysis
The perfect tool would allow systems administrators to profile applications from launch to closure and tie this in to log file capture systems at the device, network and cloud provider level, spitting out a single analysis. Imagine launching your word processor using this tool, and having that tool capture every network request, every local system I/O and every library call made.
That capture data could then be lined up with logs from switches, servers and data made available via APIs from cloud providers (and, in a perfect world, ISPs and backhaul providers). The analysis software could then look for failures. The application tried to look up authentication.cloudapp.com, but the DNS server couldn't find it? Well, that's interesting…
Perhaps more importantly, all our software needs to be aware of dynamicity. From asset management to network discovery, log file analysis to auditing. We need to collect information on how the applications we have behave, when they change, what sorts of changes are normal and what sorts of changes aren't.
Essentially, we need to be able to detect the difference between A/B testing in a delivered application, a cloud outage and the malicious injection of code. We need to be able to figure out if something is user error, the network (our fault) or a cloud provider (not our fault, but we have to take the blame for it anyway).
No vendor can build software that can infer what's normal and what's not simply from application behavior. No sysadmin can divine this from log files. Ultimately, we need the participation of application vendors in this process. We need those vendors to provide, via API, information about how all versions of their applications (including test versions) should behave so that our analysis software can compare it to the data we gather.
To my knowledge, no such application yet exists. More's the pity. We're all stuck trying to assemble something like it, using the best of the tools we have available.
What's clear is that getting as many logs into the same place and lined up precisely by time is critical. We need to see what applications are trying to do, so that we can make rational inferences about where it's gone wrong.
Ultimately, monitoring software has to evolve. It has to be not only application aware, but able to analyze the whole chain of causality from application to cloud and back again. In the meantime, let's try to keep the users from digging tunnels to their destination from the back of the bathroom.
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.