The Cranky Admin
The Art and Science of Workload Monitoring
What different kinds of monitoring are available for workloads, and what do the numbers you get back actually mean?
Traditional IT monitoring solutions provide a variety of raw metrics, often with the ability to set thresholds. When a metric moves above or below the desired threshold, an alert is generated. This is a fairly simplistic system with two critical flaws.
The first is that the cause of IT problems is usually complex. The alert we receive could properly indicate the source of the problem, or it could be occurring because something in another part of the IT stack has gone wrong, and we're left to unpack it. This is something previously discussed here at length.
The second flaw in traditional monitoring is that administrators don't always understand raw metrics. Misunderstanding can lead to monitoring the wrong thing. In turn, this can lead to inadequate or inappropriate alert generation. Misunderstanding of raw metrics can also lead to setting inappropriate thresholds for alerts, or to attempting to fix the wrong elements of the IT stack during troubleshooting.
Hunting for Bottlenecks
Consider for a moment an application that appears to be responding to requests more slowly than normal. We'll put the inadequacy of human interpretation of average response times to one side, and assume that we have some empirical means to measure application responsiveness. Many things could be the bottleneck that's slowing down the application.
The easiest bottleneck to spot and understand is a CPU bottleneck. If the application consumes all available CPU capacity, it's pretty clear that's the bottleneck. If the application is heavily single-threaded, and thus unable to make adequate use of additional processor cores, that will also be fairly self-evident.
In the case of the CPU being the bottleneck, either you can provide additional CPU resources or you can't. In most cases, if additional CPU power can be provided, that's the cheapest route. If the hardware wall has been hit, however, then to solve CPU problems, one needs to dig into the code of the application and start optimizing things.
Customers deploying commercial, off the shelf applications don't have this luxury. They can submit feature requests, perhaps even pay for some custom coding, but that's usually where it ends. If the code they are using is open source they might be able to contribute to the project, but that isn't guaranteed.
Network as a bottleneck is remarkably similar to CPU. There isn't a whole lot that a customer can do to resolve it, short of providing bigger network pipes or putting money into rewriting the code. The latter is highly unlikely to produce any real-world benefits, as when applications are network-bound it's rarely because they are inefficiently filling the pipes with unnecessary overhead.
CPU and networking are the easy bottlenecks; the rest are hard.
Storage Headaches
Storage is the one everyone tends to get wrong. This is because so many different things can impact storage, storage can come in so many different flavors, and there are so many solutions to solving storage issues. Storage is a complex beast. There isn't one thing that causes performance issues and there isn't only one solution.
Without trying to teach a master class on storage and related problems, there are four primary things to consider when talking about storage: Input/Output Operations Per Second (IOPS), latency, disk queues and throughput. Each of these things tells us different things about our storage, is affected by different things and means different things to different applications.
There are two basic storage request types: lots of little requests, and great big long requests. Lots of little requests will drive high IOPS, while big long requests will drive high throughput. Low latency is always good, but it is absolutely critical for heavily random workloads, typical of those applications that have lots of little requests.
Disk queues are an important measure of the underlying hardware. From a hardware point of view, most of the time we want storage devices and controllers with the largest possible queue depth. There is no point in our storage sitting idle, and large queue depths make sure that when there is lots to do, operating systems and applications can transact with storage with minimal waiting.
Operating systems don't report queue depth directly. There are many layers where queues can exist, and not all are directly visible to the operating system, especially with network attached storage. This leads us to monitoring disk queue length, which is a measure of how many outstanding I/O operations there are.
Spinning magnetic drives use an elevator algorithm to look at all the pending I/Os in its visible queue and find the optimal pattern for moving the magnetic arm that reads and writes from the media. Flash drives similarly prefer to be able to write entire blocks, because write operations on flash drives require an erase and the block -- which contains multiple pages -- is the smallest unit of a flash drive that can be erased.
So keeping the queues on individual storage media is good. Filling up the queues on the storage controllers all the way could be bad. And what's reported by the operating system as disk queue length is only tangentially related to either of them.
A disk queue length higher than 1 means that your application is making more requests than your storage can actually deliver. Queue depth is more nebulous. Deep queues filled with pending I/Os are not actually a bad thing, if they're at the right place in the storage stack.
And that's just the basic hardware.
Bottlenecks Come In Layers
Much of the storage in a datacenter is network attached, and network-attached storage is affected by network performance. If an application starts to slow down and it appears to be storage related, it might be the storage hardware -- and it might not. It all depends on what else you've got sharing that same network, and/or how much network aggregation and contention there is between hosts that might be asking for storage services and the actual devices delivering them.
RAM utilization can also affect storage performance. RAM as a bottleneck is a tricky thing. Applications and operating systems store vital bits of themselves in RAM. RAM is also used by applications, operating systems and hypervisors for caching.
A system that performed beautifully yesterday might slow to a crawl today -- not because there's anything inherently wrong with the underlying storage layer, but simply because some change reduced the amount of RAM used for some layer of cache. Without that frequently accessed data living in lightning-quick RAM, the application is forced to go to the disk for every request. Every request for data that used to be cached is time that could have been spent by the storage system doing something else.
The important concept here is that of the consequence cascade. A minor and seemingly irrelevant change to something way over here can cascade through multiple layers of technology to make it seem like the problem is over there.
Change Tracking
All this means that threshold-based monitoring is outdated to the point of pointlessness. We could investigate and troubleshoot consequence cascades when our organizations had a few dozen applications under management -- today they can easily have thousands.
Relying on administrators to understand what all the raw metrics mean and how they can affect a given application is asking for trouble. None of us can keep track of all of that, especially if we don't work with a given application on a regular basis.
When setting up monitoring for an application, it helps to be able to discover, identify and log all the other elements in a datacenter upon which that service depends. That way, when another administrator -- or our future selves -- comes across it we don't have reverse-engineer the problem every single time to fix it.
Up to Scratch?
Not all monitoring software is up to the task, especially given how dynamic the modern datacenter can be. We need monitoring software that tells us when something is out of bounds. We also need monitoring software that snapshots entire interrelated stacks of parameters.
Most importantly, we need software that detects changes to environments and provides us the ability to easily traverse the logs of multiple services, applications, operations systems, hypervisors and physical devices that each hold some piece of the puzzle.
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.