Back to the Basics with Server & App Monitoring
IT monitoring is largely broken up into network monitoring and everything else. Trevor Pott looks at the fundamental underpinnings of both network and server and application monitoring.
Sometimes it's helpful to just go back to the basics. One of the most basic considerations in IT is monitoring. While there's room to quibble, in practice, IT monitoring is largely broken up into network monitoring and everything else. Let's take a look at the basics.
Network monitoring is typically a separate solution from server and application monitoring. Network monitoring concerns itself with the network environment through which applications will communicate. Application and server monitoring focuses on the apps themselves, and the infrastructure upon which they run, with some overlap between the two.
Network Monitoring and Analysis
Networking monitoring is complex enough to be the subject not only of its own article, but probably a fairly thick book. Network monitoring is strongly wrapped up in modern IT security, because a lot of modern security relies on being able to observe or manipulate network data streams.
Here, network monitoring and security discussions would branch out into topics such as baselining, microsegmentation, honey pots, automated incident response and threat management via filtering. Network monitoring is also a particularly complex topic because a lot of it involves attempting to deduce information about networks you don't control. This is done by monitoring characteristics beyond the ones you truly want to observe.
You may want to know, for example, which network device along a given path is malfunctioning so that you can contact the appropriate owner and have them repair it. Not owning those network devices, however, the best you can usually do is send pings and traceroutes, watch what happens to the packets and make educated guesses. As cloud computing becomes a more important part of an organization's IT mix, this class of network monitoring and analysis becomes increasingly important.
Basic Network Monitoring
Apart from the more detailed monitoring of the network environment, there's a standard suite of basic network monitoring probes. These basic network metrics are typically under the control of systems administrators, and more directly impact the ability of workloads to function.
In order to avoid IP address exhaustion, a proper server and application monitoring solution needs to be able to keep track of IP addressing and subnetting. Ideally, this would include directly monitoring DHCP servers, both for availability and to read the status of their IP pools. Note that I said "server and application" monitoring here. This is because this level of monitoring is required by administrators keeping an eye on their applications, and cannot be the sole province of network monitoring applications.
Similarly, server and application monitoring solutions should also be keeping an eye on whether or not different subnets can successfully talk to one another (network mapping), as well as checking to see if switches and routers are online and operational.
In a perfect world, server and application monitoring would also concern itself with observing the DNS environment. This includes checking to see if DNS servers are up, as well as that they can fully resolve canary domains deemed important to ongoing operations. In some configurations, this monitoring is also used to ensure that DNS cannot resolve some domains, which is a means of checking to see that DNS filtering aimed at filtering malicious domains is operational.
Isalive monitors are the most basic form of monitoring. When most administrators think about monitoring of any kind, they often think of isalive monitors.
Isalive monitors consist of tools like verifying that a device returns a ping, or that a website returns HTTP 200 OK. Isalive monitors do not provide any information on whether a service is functioning correctly. They can only tell if a service is operational.
An operating system environment (OSE), for example, may respond to a ping even though the application that it hosts has crashed, or the OSE or application is otherwise unresponsive. In addition to being a basic measure of service availability, isalives are also useful as part of automation.
In the case of virtualization or public cloud Infrastructure as a Service (IaaS), for example, pings are often used to tell when a virtual machine has finished deployment, or has completed a reboot. Automation solutions will usually give a VM a fixed number of seconds to finish loading services after a ping is detected before proceeding to connect to it and apply the next steps in the automation process.
Because isalives don't return anything more than the most basic information, server and application monitoring solutions tend to have a suite of application-specific (or generic, but customizable) probes. These probes are designed to connect to an application or service and perform a basic, but important action that proves the application or service is operating as expected.
In the case of a Web server, a probe might be configured to do more than look for an HTTP 200 OK. Instead it might request the full Web page and look for a specific HTML tag, usually deliberately placed at the bottom of the page for this purpose. You might also perform a specific SQL query against a database server at regular intervals, or verify an appliance's availability by having an SSH probe log in and obtain the output of a script.
Both network monitoring and server and application monitoring solutions collect event logs. Network monitoring solutions usually ship with the ability to pull log files from common switch and router models, while server and application monitoring solutions need to be able to collect logs from more diverse sources.
The Windows event log provides a single source to collect logs for Windows platforms. While it is possible for individual applications to create and store logs somewhere else, mainstream Commercial, Off The Shelf (COTS) Windows applications tend to send at least error information to the Windows event log. Some applications will register fully with the event log system and do all of their log handling through there.
The Windows event log is in fact several different logs made available by the Windows Event Service. The Windows Event Service serves as a log aggregation mechanism for the Windows OSE.
Non-Windows systems typically use plain text log files. Some Linux distributions that use systemd use binary logs by default, without also creating a plain text log file. Non-Windows systems also typically ship with syslogd or other syslog-compatible log aggregation service. This is comparable to the Windows Event Service.
Connecting to an OSE's internal log aggregation service allows for monitoring that device, OSE, or application for warnings and errors. Monitoring and alerting events in real time is only of limited usefulness, however, and so one of the major purposes of monitoring solutions is logfile aggregation.
Basic logfile aggregation for non-Windows systems is reasonably easy. Virtually every device and OSE that isn't Windows supports sending logs to a remote syslog server out of the box. This gets your logs into one place, but it doesn't let you do much with them.
Monitoring tools are chosen in large part because of the analytics and reporting capabilities they offer. This is entirely dependent on the quality of their logfile centralization. Efficient, performant logfile centralization that supports a diversity of devices, OSEs and applications provides the raw material for analysis and reporting that provides actionable information.
Monitoring applications also collect performance data from the solutions they monitor. Again, efficient centralization of this data is a key differentiator for monitoring vendors.
Some monitoring applications allow administrators to connect to a solution's performance monitoring capabilities to observe performance feedback in real time; however, performance counters can generate an overwhelming amount of information. This means that in practice not all performance data is collected, let alone stored or analyzed.
Unlike event log management, there isn't a lot of standardization of performance information. While individual OSEs tend to have a centralized performance monitoring service, it is common for applications to build their own performance monitoring service, rather than use the native functionality of the OSE.
Some solutions require a monitoring agent to be deployed to the OSE or device in order to collect all the required data. Agents may also be used to briefly buffer data -- typically performance data -- in order to help compensate for adverse network conditions.
Agents are more likely to be required in order to provide in-depth monitoring of certain applications than they are to perform basic monitoring of a device or OSE. Agentless monitoring is possible using the SNMP protocol, or the Windows Management Instrumentation (WMI) service.
Outside of Windows, agentless monitoring is a more complex affair. Linux exposes a lot of data via the /proc and /sys file systems, and other non-Windows OSEs usually have similar functionality. There are also WEBM, D-Bus, Open-LMI (and many more), which can serve a purpose similar to WMI.
As is normal with the open source world, the diversity of solutions has prevented any real standardization from occurring, making the completeness of a monitoring solution's support increasingly important in the modern heterogeneous, hybrid multi-cloud world.
There is a lot to monitor, and monitoring applications often represent a significant investment. Before selecting a monitoring application it helps to have a good idea of what you wish to monitor. It's also worth looking at the history of the vendors being considered, and how quickly they add support to their products for new applications, as well as whether they have a history of fully supporting heterogeneous environments.
Entirely apart from the cost of monitoring solutions, configuring monitoring is a time-consuming process. Making sure that you won't have to throw away your monitoring solution a few years down the road because it won't offer support for what your organization uses is important.
Organizations may rotate out servers, switches and even applications every few years. Monitoring solutions, however, tend to need to be useful over decades.
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.