The Cranky Admin

When It Comes To IT Monitoring, Size Matters

New tools are emerging that filter intelligently for alerts and events.

Different-sized organizations have different IT needs. They may be trying to juggle information between silos of specialists, or they may be a small business admin forced by necessity to develop their IT staff into generalists. One problem that all organizations face, however, is getting the right information in front of the right people so that they can make decisions.

I have talked about the differences between how users and administrators see problems and communications problems inherent to the siloed structure of IT teams. I've prattled on about IT monitoring and mean time to innocence.

All of these topics have, in one way or another, touched upon the importance of making sure IT staffs have the right tools for the job. These tools include monitoring software, log file collection, analytics software and so forth. There is another issue that is tightly coupled with these topics, and that is information overload.

Information overload affects all systems administrators, regardless of organization size. Different organizations, however, are affected differently, and there is no one solution that solves all problems.

The Case Study
When we're talking about information overload, size is everything. Scale matters in a way that's difficult to comprehend unless you've worked in a really large shop.

In an SMB, one server doing something strange can flood an Inbox with a few thousand irrelevant messages that takes a systems administrator a half a cup of coffee to isolate, fix and clean up. In an enterprise, a bad patch can cause hundreds of thousands of systems to spontaneously start emitting an endless stream of error messages, flooding all available communications channels and even bringing organizations to their knees.

I have seen this with my own eyes. A bad patch hit three datacenters' worth of render nodes: 150,000 machines. Nobody noticed because the patch didn't trigger anything odd in the canary group; but when the New Year arrived and the leap second was to be applied, everything went nuts.

150,000 servers lost their minds. They started flooding logs with error messages from a service that had fallen over because it didn't understand the leap second. This in turn caused regular bleating from the monitoring agent, which considered the loss of the service critical.

The resultant mess crashed the organization's e-mail servers, flooded the local telco's SMS capabilities and racked up thousands of dollars in SMS charges (this was back in the days when you paid per text). It even resulted in at least one administrator's child smashing the company phone repeatedly with a frying pan because he was absolutely convinced that after 15 minutes of constant vibrating, falling on the floor and slowly creeping towards the family dog, that the phone was possessed.

It was hard to argue he was wrong.

Consolidating Identical Alerts
While such a scenario is an obvious example of information overload brought about by an exceptional circumstance, information overload happens in far more mundane and operating-as-designed circumstances. Again, scale changes everything.

For a small business with 40 servers, there is benefit in having each of those servers notify the administrator when there are patches ready. For an organization with 150,000 servers, that's insane. Most of those are likely to be identical, so is there really a point in having 80,000 servers all say "patch me" at the same time?

In the case of patches, we have patch management software: System Center (paid) or WSUS (free) for Windows, and Satellite (paid) or Spacewalk (free) for Linux. Servers connected to these systems will have their patches managed centrally. Ideally, instead of emitting complaints about their patch status to an administrator on an individual basis, the administrator receives a summary report on a regular basis from the patch management server.

This idea of consolidation is important. One summary report talking about a similar problem in aggregate instead of a seemingly unlimited stream of noise is something our poor old primate brains require to do something useful.

If you saw an inbox full of 150,000 seemingly identical e-mails, it would be rational to highlight them all and delete them. Amidst the inundation of identical alerts, however, there might very well have been something interesting and problematic that went unnoticed. Therein lies the problem.

Signal-To-Noise Ratio
Even after consolidating the obvious alert spam, administrators are left with the task of separating housekeeping messages from genuine problems. Small shops used to be able to do this without any software to help sort one from the other. Large shops have been working on this problem for decades, with varying degrees of success.

Even in small shops, however, the number of workloads under management is increasing to the point where this may no longer be feasible. This is in large part due to the fact that smaller organizations likely only have one or two technical staff members. These generalists must manage everything from hardware to software and even cloud services; there may be fewer workloads total than in a large enterpriseS, but individuals in the SMB are responsible for such a diversity of management tasks they can be quickly overwhelmed.

One of the big problems that any administrator can face is the monotony of such messages. Even if what turns up in our inbox is a summary instead of individual alerts, seeing the same summary every single morning when we log in quickly reduces the relevance to us.

We are either going to log in to the patch management server, release patches to a canary group, pursue feedback, then release to production, or we're not. Seeing that summary e-mail in our Inbox isn't really going to change our routine any, and so we mentally filter it out.

The danger here is that in getting used to mentally filtering out certain classes of e-mail, whether based on sender, subject or what-have-you, we might miss a genuine call for help. Our patch management summary might, for example, contain information that a group of servers had failed a critical patch, something we might not see until we go looking for it. That can go to bad places pretty quickly.

A New Hope
All is not lost. There is hope. Two similar and closely linked categories of software exist to help us solve the above problem. They are called Alert Correlation and Event Correlation. The canonical startup in this area is Big Panda. They have made a name for themselves solving exactly the problems described above.

Their success, combined with increased demand from existing customers, has led the big names in monitoring software to build solutions into their suites as well. These range from Big Panda-like Alert/Event Correlation solutions to more innovative solutions that veer into Big Data analysis.

Monitoring solutions that monitor multiple applications, hardware devices and services can not only pull together like alerts and suppress stuff you don't want to see, they can also be set up to do service-level correlation.

This service relies on these applications, which rely on these servers, and these OSes, on this hardware. Don't send me 15 alerts when it all goes squirrely, but give me a summary of what's wrong with the whole related stack.  

Alert Correlation and Event Correlation are rapidly evolving elements of the monitoring and analysis landscape, but they have very quickly become absolutely indispensable pieces of the systems administration toolchain. They are one more item we all need to invest in, but they promise to help us regain some semblance of our sanity.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.


Subscribe on YouTube