In-Depth
IT Monitoring Must-Haves
I can't tell you which monitoring solution is right for your mix of workloads and infrastructure, but I can tell you what to keep an eye out for.
Have you ever tried to create monitoring and alerting for 40 applications, operating system environments (OSEs) and underlying infrastructure? How about 400? Or 4,000? As the ratio of applications to administrators grows, ease of use becomes a critical concern, meaning that ease of use in monitoring applications is more important now than ever.
I've written and discarded this piece a dozen times. If I write what I feel to be true then this article would seem like a cynical, even vicious defenestration of monitoring vendors. If I try to tone it down, I can't help but read it as though I'm pandering to someone. I've circled this article like a cat all weekend, and ultimately, I've decided to write what I know.
Let's skip to the truth of things: Monitoring applications are awful. Every. Single. One.
There's variability in the awfulness. Some are so awful as to be downright unusable. Some lack any meaningful attempt at providing critical features like operations dashboards or reporting. Some are just so miserable to install and configure that they almost never get used as designed. Some are just too pricy.
Every monitoring application out there has some horrible, glaring flaw that makes systems administrators loathe and abhor them. Ultimately, we're reduced to approaching our monitoring applications like we do our politicians: instead of getting what we want or feel is required, we're reduced to choosing the option that we hope does the least amount of damage.
So how do we choose?
Sleep Is Good
As a sysadmin, I'm going to start with what affects me directly. I'm not 20 anymore. I can't go five years on four hours of sleep a night. Because of this, I need a tool that will help me go from "Huh, who's this, do you know it's 4 a.m.?" to, "Oh, I see what the problem is," without requiring coffee.
I want to roll over in bed, open my notebook, see what's wrong, and fix whatever it is before I become fully awake. To that end, any monitoring solution I let my clients spend money on must have three things: automatic discovery of new infrastructure components and workloads, event correlation, and automated root cause analysis.
I don't want to have things go boom because someone added something to the network that I didn't know about. Similarly, I'm not interested in wading through a ton of alerts in the wee hours of the morning. What I want is a solution that says, "The Web server is not working because the storage array went kablooie." I will then reboot the storage array and go back to sleep.
Sleep is good.
Management at Scale
If I have a tiny shop with 40 workloads and 10 pieces of physical infrastructure, I can solve my problems with any of a number of monitoring applications. I wouldn't even need to install one on-premises: at that scale I could use a cloud-based-as-a-service startup solution and all would be good. But this changes when I start looking at 400 workloads.
At this scale, monitoring solutions are unworkable unless they come with a strong policy engine. Grouping workloads and infrastructure components allows them to be approached in a human-comprehensible fashion. This blob is dev and test. That blob is finance. This blob over here is mission-critical stuff.
More important, a grouping- and-policy-based approach to monitoring allows for fine-tuning of alerting and remediation. Some groups of infrastructure and workloads aren't important enough to wake a sys admin. Others are.
Similarly, there are some workloads that have known issues: If you see an alert of this type from this group, then run this script to remediate the problem. Automation is your friend.
Monitoring solutions that intend to be useful at any sort of scale also must have a modern API. I personally prefer a JSON-based RESTful API, in large part because REST is the de facto API standard today, and it's good to keep your scripting simple.
An API allows for infrastructure automation. Monitoring can be assigned to new workloads as they're instantiated, and those workloads can be placed into the appropriate groups. With a proper policy engine, administrators configure monitoring solutions at a group level, and workloads added in this fashion inherit the configuration of the parent group.
Templates are another critical feature in which far too many monitoring solution vendors don't invest. My No. 1 beef with monitoring vendors is that most of them seem to provide templates for a handful of "enterprise" workloads, and then hand-wave customers off in the general direction of "the community" if they wish anything more.
Unfortunately, few monitoring vendors have successfully cultivated a decent community repository of templates, in part because few monitoring vendors bother to invest in their community. Building new templates for each piece of infrastructure purchased or each workload type spun up is doable at 40 workloads. At 400 workloads this is a full-time job. At 4,000 workloads it's downright impossible.
Managing monitoring solutions at scale requires finding a vendor that continually invests in their product, in the community that surrounds it and provides features that are relevant for managing workloads at scale. It's a big ask; especially if you have a heterogeneous environment with infrastructure components and workloads from multiple vendors. But it's increasingly not an option: Only 40 workloads is a rarity, and for most organizations that number isn't decreasing.
Monitoring at Scale
In addition to managing the monitoring solution at scale, it would be super if the monitoring solution provided useful information at scale. Virtually every monitoring solution available allows drilling down into the details of an individual workload or infrastructure component. Providing meaningful insights about how workloads are operating at a wider view is where things get tricky.
A large part of this relies on having a UI that's designed for scale. When you look at a datacenter as a whole -- or worldwide operations as a whole -- are you provided with information like, "X number of critical events," or something more meaningful? This is another place that event correlation and automated root cause analysis really matter.
Looking at a worldwide operations view, the number of events isn't relevant. What's relevant is the number and type of services that are impacted, and how severe those impacts are. If a low-traffic Web server is experiencing performance impacts, but still functional, then that isn't really a priority if there's a mission-critical service outage with which to deal.
This is where features like customizable Network Operations Center (NOC) dashboards really start to shine. I don't need a monitoring solution that pesters me with every single event from every single item in my infrastructure.
I need a monitoring solution that extracts signal from that noise, and helps me prioritize what to fix. I also need to be able to see what's relevant at a glance, which is the whole purpose of a NOC, and why dashboards are so critical. Having those dashboards be fully interactive, where administrators can drill down into individual faulting elements and initiate remediation directly from the dashboard has proven an invaluable capability, as well.
As an adjunct to the dashboard, advanced search capabilities are absolutely vital at scale. It doesn't take much for an organization's IT to grow beyond an administrator's ability to remember what everything is, and where it's located. Especially in today's hybrid multi-cloud world. The ability to search for workloads or infrastructure based on a wide range of criteria helps solve problems quickly.
Compliance
Reporting and analytics round out my must-haves. If none of the other features discussed here are causing monitoring vendors to engage care and up their game, the money to be made off of regulatory compliance hopefully will.
The European Union's General Data Protection Regulation (GDPR) enters into enforcement on May 25, 2018, and a lot more organizations worldwide are subject to this than realize it. There are others, but the GDPR is the one that gets all the press, and it's one that's causing my clients to sit up and take notice.
For many organizations complying with regulations requires passing audits. Passing audits increasingly involves being able to show auditors how your IT is designed and configured, and whether it's operating as designed. Assembling that information by hand would be a month's work at 40 workloads. It's functionally impossible at 400, and grounds for being shot into the sun at 4,000.
Monitoring solutions absolutely must come with top-notch, auditor-satisfying reporting and analytics. There's no excuse for not having these capabilities, and if a monitoring solution doesn't have them, nobody should pay money for it.
Feet to the Fire
Every monitoring solution falls down on at least one of the items discussed here. That's just the way things are. The real question is: If we revisit this discussion a year from now, will that still be true?
Which vendors are engaging care? Which vendors still believe in supporting on-premises datacenters? Which vendors are committed to supporting heterogeneous environments? Which vendors are willing to continue developing new templates forever? Which vendors are going to invest in reporting?
I can't tell you which monitoring solution is right for your mix of workloads and infrastructure. I can only tell you what to keep an eye out for. And to say this: Hold your monitoring vendor's feet to the fire. Nothing will improve in this space unless we all do it together.
The number of workloads under management isn't going down, and we need better tools to manage them.
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.