Troubleshooting In Complex IT Environments -- Virtualization Review

Troubleshooting In Complex IT Environments

Where monitoring falls short.

By Trevor Pott
05/15/2017

In the datacenter, everything is connected. With hybrid IT being the new normal, the number of technologies underlying a seemingly simple support ticket can be staggering. Solving problems and closing out the day's tickets can involve troubleshooting both on premises and off.

It's not enough to talk about tech problems in the abstract. Real world examples make problems more concrete and easier to understand. Being a systems administrator has furnished me with enough examples of doing it wrong to last a lifetime, but the most important lesson I've learned about troubleshooting is that nothing beats understanding how all the moving parts interact.

Business Processes, File Shares and You
Let's consider a fairly normal Web serving setup that caused a fun problem. The Web site in question serves up some custom middleware that allows both staff and customers to track orders through a manufacturing process; one day it developed a problem serving up some images.

The Web site is a combination of PHP scripts, generated HTML, a couple of databases and an astonishing number of images. For the Web site to "work" -- that is, deliver content to users in a timely manner -- a number of different systems across the organization need to function.

The Web site isn't static; at any given point in time a dozen different systems and at least as many staff could be generating files or making changes. There are thousands of generated images, static pages and database entries on a slow day.

One day a ticket showed up that said the Web site was having trouble serving images. Helpdesk staff went to the site; it seemed to be serving images just fine. The ticket was bounced back to the user requesting examples of images that wouldn't load. The user provided them, and the images would load for some of the helpdesk staff but not others. Being an intermittent problem, it was kicked upstairs to the systems administrators.

Intermittent problems are the most frustrating problems to diagnose, and this particular issue proved no exception. Despite there being umpteen log files on the Web server, none of them actually logged a relevant issue. The file server that served up the files to the Web server was similarly unhelpful.

Back To Basics
In order to solve the problem we had to go back to basics and tease apart each element of the stack. The images are hosted on a Windows file server. This is due to both the Windows applications that generate the images needing to store images on an SMB share, and business process needs that require that these images be easily available and editable by staff members. Connecting to an FTP to pull down images to edit, then pushing them back up to a Web server just wasn't going to fly.

Under basic testing, each piece of the puzzle worked just fine. The Linux server mounted the SMB share just fine. The Apache Web server could see the relevant directory and serve images from it. It turns out, however, that a specific bug in Apache means that when serving files from mounted file-based network storage, one needs to make sure that EnabledSendfile and EnableMMAP are set to “off” in the httpd.conf file.

This bug only shows up under specific circumstances, and is hard to catch if you can only test components one at a time. The ability to test the whole stack at once, to gather real-time information on performance and loading, and then compare this with logs and system events would have turned two days of troubleshooting into less than an hour's worth of effort.

Today, this same Web site solution now has tentacles in Amazon's public cloud and in four separate service provider hosted setups. It not only ties together information from on-premises manufacturing processes, but several outsourced manufacturing chains as well as numerous logistics companies. This makes the ability to see the whole stack of interconnecting elements all the more important.

Limits That Change
Another example of things gone wrong concerns an increasingly important element in a modern datacenter's infrastructure: the reverse proxy. As part of ongoing modernization and security efforts, network isolation had been taking place for several years. Anything that posted a management Web site, for example, was placed behind a reverse proxy which contained various intrusion detection features.

For the most part, this worked remarkably well. Everything from VMware host Web sites to Dell iDRAC or Supermicro's IPMI Web sites worked from behind the proxy. All management Web site access worked just fine for over a year before we ran into a problem.

By the time we did finally run into a problem, almost everyone had forgotten the reverse proxy was even there. We had gotten used to using domain names to access management Web sites instead of IP addresses, and the reverse proxy just kept on doing its job, invisibly.

Eventually the day came when some virtual machine (VM) migrations were called for. At some point during the migration process, we needed to download VMs using VMware's Web-Based Datastore Browser. Downloads would start, everything would go well; but several minutes in, downloads would fail.

We crawled all over the error logs on the VMware hosts, storage servers and the vSphere server, to no avail. There were no error logs nor any reason that these downloads should fail. And then we remembered the reverse proxy.

It runs out that the reverse nginx setup on the reverse proxy had a 2GB file limit. This was a simple issue that cost hours; in large part because we didn't have the ability to even pin down which system was causing the problem.

Today, workload management consists of wrangling VMs on VMware and Scale Computing infrastructures, both on premises and on hosted service providers. Very soon it will also consist of handling VMware on Amazon's public cloud (via VMware on AWS).

Each location has its own security defenses, reverse proxies and so forth. Each location is another collection of moving parts that we as systems administrators need to keep in mind; not in order to serve workloads to our customers -- both internal and external -- but simply to access the management tools that let us diagnose the actual customer-facing workloads.

Having the Right Tools Is Important
At some point, it all gets too big. If my job were to coddle one application stack, I could easily remember all the layers in the stack.

Apache is a Docker container which sits on top of CentOS Linux. That Linux instance is presenting files to its containers, which are mounted using SMB and NFS shares for X and Y files, and via iSCSI for Z files. The Linux instance is inside a VM running on a Scale cluster, and the reverse proxy is nginx on CentOS on a VMware cluster. Internet is delivered via VLAN 1000 on the second NIC. SAN and NAS are VLAN 2000 on the third NIC. Customer access is VLAN 0 on the first NIC.

On and on and on; configuration details both minute and gross. Easy if your area of responsibility is one workload. Crushing if it's thousands.

Somewhere along the way we need a way to document all of this. What's connected to what, and how. We need more than Visio diagrams, spreadsheets and notes in a text file. We need to be able to enter this information into a system that will monitor each connection and offer up details of how the interconnected stack of technologies works when something goes sideways.

In essence, modern systems administration requires a means to augment the memory of systems administrators. A Google for datacenter infrastructure, both virtual and physical, on-premises and off. We need a means to visualize these issues, and to test each step in the chain, both individual and together, because the whole chain may behave differently than when testing individual links.

Datacenters aren't going to get less complex, and hybrid IT isn't going away. The question now is which vendors will deliver a solution to this increasingly important systems administration problem.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.