Dan's Take

Application Troubleshooting Gets Harder

A survey showed that more than a third of companies rely on user notifications to alert them to trouble, instead of monitoring tools.

Although I seldom comment on vendor-created surveys, one recently published by Stackify offered some interesting and quite possibly useful information about dealing with application performance problems and outages. In the company's words, "The survey showed that 37% of respondents rely on user notifications to identify issues, and many problems take more than a half day to rectify, but despite these alarming troubles, the survey also revealed that adoption of next generation unified application troubleshooting tools drastically improves response times and minimizes customer impact."

Key findings from the Stackify 2014 Application Troubleshooting Report include:

  • 85% of organizations are utilizing multiple internally developed applications, with more than one-third developing and supporting over 10 applications.
  • While logs and errors topped the list of data sources used to troubleshoot application issues, error aggregation tools fell behind infrastructure monitoring and notification tools in a list of the top tools.
  • Even with log management tools at the top of the list, a full one-third of organizations or more aren't using any tools, making application troubleshooting largely a manual process of collecting and correlating error, log and supporting data.
  • Organizations with standalone troubleshooting tools cited that 52% of issues taking a half of a day to find the root cause, whereas those with integrated tools only cited 37%.
  • Organization using integrated tools are able to resolve issues a full 80% of the time without impacting users, whereas those using standalone tools only do so 48% of the time.

It's clear that the IT administrator's life is very complicated and that he/she is expected to monitor many systems, which are made up of many distributed components, using a patchwork quilt of tools.

Dan's Take: Complex Systems Need Strong Troubleshooting Tools
Our production systems are increasingly made up of services that are distributed, multi-system and even multi-data center processes. These production systems just might include components executing on mainframes, midrange UNIX or single vendor operating systems, industry standard systems executing Windows or Linux, and a host (pun intended) of other data center and communications components. When something slows down or stops completely, IT administrative staff may have a difficult time discovering what's actually happening, where the error or errors are, and quickly addressing the problem.

Many suppliers, including companies such as AppNeta, BMC, CA, Exoprise, Logentries, Loggly, HP, IBM, Stackify, Splunk and Sumo Logic, offer performance and systems management technology designed to dive into logs of operational and application data to hunt down the problem. Stackify's survey, while somewhat self-serving, also puts what typically is happening in many data centers into focus. It would be hard to contest the results of the survey.

About the Author

Daniel Kusnetzky, a reformed software engineer and product manager, founded Kusnetzky Group LLC in 2006. He's literally written the book on virtualization and often comments on cloud computing, mobility and systems software. He has been a business unit manager at a hardware company and head of corporate marketing and strategy at a software company.


Subscribe on YouTube