DevOps Agile IT: Monitoring to reduce Mean Time To Recovery (MTTR)

One of the goals of DevOps Agile IT is to reduce the Mean Time To Recovery (MTTR). The monitoring part of the DevOps Tool chain plays a major part in measuring and reducing repair time. However, setting up decent monitoring is not an easy feat, especially in large enterprises with loads of legacy systems. I have seen several larger companies fail and succeed. I think it is worth sharing some of their successes and failures. You will find some reductions in MTTR turn-out to be quite simple.

Breaking down MTTR

Before I start talking about reducing MTTR, it helps to define several stages of what happens between the moment something breaks down, and recovery of the services. The first “stage” of course is preventing you’d need any repairs! Standardization, automation, testing, infrastructure-, and application design are all key in preventing down-time, but the sad thing is, that things will still break down from time to time. The stages I will focus on in this blog are detection-, assignment-, diagnose-, and repair time.

“If you want to start monitoring, the first thing you must realize is that you probably already are!”

Lots of technology that you bought has some form of a (network) element manager, including its monitoring capabilities. For example, Windows comes with System Center Operations Manager (SCOM). Inevitably somebody in your organization has installed Nagios or built a custom-made monitoring tool. Previous attempts to set up monitoring have probably taken place. And there might even be an event console somewhere with a yucky IBM, HP or BMC logo on it, which hardly anybody pays attention to.

Start monitoring based on already available data

Don’t throw out those old shoes and build something new. What really helps you through those war-room nights is to have an enterprise view on readily available monitoring information. Which can range from datacenter information all the way up-the-stack to business application data. You’ll be surprised by the amount of useful information you already have. So, what you need is a centralized data-lake (for example ElasticSearch or Splunk) on top of the existing tools, and make sure everybody shares their data into the data-lake.

Notifications are key for proactive monitoring

A quick win for reducing detection time is to make sure you have phone, SMS or “App” central notification capabilities (e.g. by using PagerDuty).

For failures to be detected it helps if somebody other than the customer is paying attention. You might have seen monitoring dashboards in your office building on a big screen, where things turn red if something breaks down, and sometimes it even makes a nice beeping noise. Well that won’t really help if you’re in bed, now will it?

Notification can cut considerable chunks of the MTTR, especially at night. Of course, some customer will complain and notify you something is wrong, but relevant alerts on a dashboard might be from a related system, which means it can take quite some time to diagnose which related system failed. If you’ve setup notification you might prevent a customer complaint if you’re lucky. A second quick win is the assignment time you save, because the notification can be directed straight to the responsible engineer on-call, versus first diagnosing what part of your IT-landscape failed and then making the call.

Classify event data to reduce MTTR

Don’t be the boy that cried wolf. A common issue is a huge flood of event-, alert- or notification data. Which leads to people ignoring the information. What a single engineer might not realize is that the total amount of event data on enterprise level can be humongous, so dumping a lot of monitoring data in the data-lake without any “hygiene” is not helpful in-it-self, even though it might be well intended. What really helps is to define types of events and classify event data:

First, an event is a measurable occurrence for which its relevance has not yet been determined (including metric data). These events should be classified as “Diagnostic” and can be used for deep analysis and understanding the behavior of your IT system. Think about stuff like syslogs, application logs, SNMP traps, etc.
A special event type are all those events that require you to act, I’ll call those events alerts. For example, a warning that your disk is almost full.
The last type is a special alert that indicates a part of the IT landscape has broken down, and I call this one a health-state alert.

Note: You can use event- severity or criticality for this kind of classification but often this is already in place without consistent definition on enterprise level.

The health-state alert is the type that will help you to quickly make a hypothesis of what might be going on, pin-point to root causes or required support groups and cut down time spent on diagnosis. Especially when you understand the makeup of both vertical relations (e.g. application on what server) and horizontal relations (e.g. application-to-application) in your IT landscape. How you can achieve that is a whole new blog of its own, so I won’t go in there. But a simple tip here is that most companies have some idea about the makeup of the important chains, so you can setup a single dashboard showing the health state for all related components for the parts of the chain you do understand (don’t forget to up-date it now and then). Also make sure you create capabilities to report planned downtime and create mechanisms to ignore related health-state alerts.

Automate the deployment of monitoring

Automate deployments of monitoring (agents) and the configuration of health-state monitoring as part of the stacks you deploy.

As well as having too many events, you can also have too little. The easiest way of making sure you get valuable health state data is to make sure you automate the deployment of monitoring. There are some real obvious events that you can set up monitoring for by default, here are some examples:

Is the (web) server up-and-running?
Are required ports open?
Can the CPU cope with the load?
Is the back-end database running?

And you can set up proactive solutions, creating alerts for disk- or memory warnings on capacity issues, password or certificates about to expire, etc.

Understanding the behavior of IT systems will really get you going!

Besides that, a very useful key-indicator something broke-down is looking at flows. For example, the number of transactions flowing through applications. Although this is slightly more advanced since zero transactions might be both a failure or normal behavior (e.g. during night times).

Let teams use the event data to expand on creating health state alerts. Create some dashboards on the number of errors and warnings per server over time and evaluate what errors or warnings were created at times of unavailability or bad performance. Understanding the application behavior in terms of transaction data on its own and in relation to errors and warnings is great input for the continuous improvement cycle and preventing down-time altogether. It is also useful for creating sound data models if you want to do more advanced things like utilizing anomaly detection.

“Monitoring often depends on local heroes and does not grow to a mature enterprise way of working.”

Once you understand your IT system behavior repair time can be reduced.

Since you don’t want to cause unnecessary down time because for example you’ve automated restarting some daemon, it helps if you first really understand the behavior of your IT-system. Often you will find taking away root causes is the better solution. However, runbook automation combined with monitoring can establish some basic auto healing, and cut considerable amounts of your repair time. The runbook automation should then automate processes like troubleshooting, starting and stopping systems, trigger evasive routes, etc.

Closing statements

Some of the things I’ve noted down might seem common sense but if it is one thing I learned then it’s that it can be helpful to state the obvious! Especially decision makers are not always aware, but I also encountered quite some engineers with poor monitoring knowledge. Monitoring often depends on local heroes and does not grow to a mature enterprise way of working.

As you may have noticed, I haven’t written anything about monitoring MTTR to measure if you’re making any progress. Again, that’s a whole new blog of its own. If I’ll get sufficient positive response on this blog I might consider sharing some more experiences.