Mean time to detect (MTTD) is one of the main key performance indicators in incident management. It refers to the mean amount of time it takes for the organization to discover—or detect—an incident. The sooner an organization finds out about a problem, the better. For instance: in the software development field, we know that bugs are cheaper to fix the sooner you find them. When it comes to system outages, any second results in more financial loss, so you want to get your systems back online ASAP.
Because of that, it makes sense that you’d want to keep your organization’s MTTD values as low as possible. After all, you want to discover problems fast and solve them faster. However, there are more reasons why keeping a low value for MTTD is desirable, and we’ll address them today since this post is all about MTTD.
You’ll learn in more detail what MTTD represents inside an organization. You’ll know about time detection and why it’s important. And since it wouldn’t make much sense to write a whole post about a metric without teaching how to calculate it, we’ll also show you how to calculate MTTD in practice. It’s probably easier than you imagine. Finally, after learning about MTTD, you’ll learn about related metrics and also take a look at some of the tools that can make monitoring such metrics easier.
Sound good? Then let’s dig in.
Defining MTTD
MTTD stands for mean time to detect—although mean time to discover also works. MTTD is an essential indicator in the world of incident management. It indicates how long it takes for an organization to discover or detect problems.
Both the name and definition of this metric make its importance very clear. After all, we all want incidents to be discovered sooner rather than later, so we can fix them ASAP. The longer a problem goes unnoticed, the more time it has to wreak havoc inside a system.
However, that’s not the only reason why MTTD is so essential to organizations. There’s another, subtler reason we’ll examine next.
Why Keeping Your MTTD Down Matters so Much
The sooner you learn about issues inside your organization, the sooner you can fix them. When you have the opportunity to fix a problem sooner rather than later, you most likely should take it. Fixing problems as quickly as possible not only stops them from causing more damage; it’s also easier and cheaper.
MTTD is also a valuable metric for organizations adopting DevOps. Why is that? Simple: tracking and improving your organization’s MTTD can be a great way to evaluate the fitness of your incident management processes, including your log management and monitoring strategies.
Think about it: If an organization has a great incident management strategy in place, including solid monitoring and observability capabilities, it shouldn’t have trouble detecting issues quickly. In other words, low MTTD is evidence of healthy incident management capabilities. The opposite is also true: Taking too long to discover incidents isn’t bad only because of the incident itself. It’s also a testimony to how poor an organization’s monitoring approach is.
How Is MTTD Calculated?
Calculating mean time to detect isn’t hard at all. Start by measuring how much time passed between when an incident began and when someone discovered it. If an incident started at 8 PM and was discovered at 8:25 PM, it’s obvious it took 25 minutes for it to be discovered.
From there, you should use records of detection time from several incidents and then calculate the average detection time.
For instance, consider the following table:
Start time | Detection time | Elapsed time In minutes |
2:35 AM | 3:35 AM | 60 |
4:13 AM | 5:30 PM | 77 |
5:10 PM | 5:55 PM | 45 |
1:55 AM | 2:25 AM | 30 |
The table above shows the start and detection times for four incidents, as well as the elapsed time, depicted in minutes.
To calculate the MTTD for the incidents above, simply add all of the total detection times and then divide by the number of incidents:
(60 + 77 + 45 + 30) / 4
The calculation above results in 53. So, the mean time to detection for the incidents listed in the table is 53 minutes.
Going Further
This is just a simple example. Depending on your organization’s needs, you can make the MTTD calculation more complex or sophisticated. For instance, an organization might feel the need to remove outliers from its list of detection times since values that are much higher or much lower than most other detecting times can easily disturb the resulting average time.
Also, bear in mind that not all incidents are created equal. They might differ in severity, for example. When allocating resources, it makes sense to prioritize issues that are more pressing, such as security breaches. That’s why some organizations choose to tier their incidents by severity. That way, you can calculate a value of MTTD for each of those layers, which might allow you to get a more detailed and granular view of your organization’s incident response capabilities.
Finally, keep in mind that for something like MTTD to work, you need ways to keep track of when incidents occur. You need some way for systems to record information about specific events. That’s where concepts like observability and monitoring (e.g., logs—more on this later!) shine: they give organizations the power to take a glimpse at the internals of their systems by looking at signals recorded outside the systems.
Move Fast, Don’t Break Things. But If You Do, Learn About It ASAP!
In the ultra-competitive era we live in, tech organizations can’t afford to go slow. But they also can’t afford to ship low-quality software or allow their services to be offline for extended periods. That’s why adopting concepts like DevOps is so crucial for modern organizations. Undergoing a DevOps transformation can help organizations adopt the processes, approaches, and tools they need to go fast and not break things.
For DevOps teams, it’s essential to have metrics and indicators. You can use those to evaluate your organization’s effectiveness in handling incidents. Mean time to detect isn’t the only metric available to DevOps teams, but it’s one of the easiest to track.
MTTD is an essential metric for any organization that wants to avoid problems like system outages. The sooner you learn about an issue, the sooner you can fix it, and the less damage it can cause.
However, there’s another critical use case for this metric. It might serve as a thermometer, so to speak, to evaluate the health of an organization’s incident management capabilities. Think about it: if your organization has a great strategy for discovering outages and system flaws, you likely can respond to incidents—and fix them—quickly. The opposite is also true: if it takes too long to discover issues, that’s a sign that your organization might need to improve its incident management protocols. If this sounds like your organization, don’t despair! Knowing how you can improve is half the battle. The next step is to arm yourself with tools that can help improve your incident management response. And like always, we’ve got you covered.
If this sounds like your organization, don’t despair! Knowing how you can improve is half the battle. The next step is to arm yourself with tools that can help improve your incident management response. And like always, we’ve got you covered.
For example, a log management solution that offers real-time monitoring can be an invaluable addition to your workflow. Consider Scalyr, a comprehensive platform that will give you excellent visualization capabilities, super-fast search, and the ability to track many important metrics in real-time. If your organization struggles with incident management and mean time to detect, Scalyr can help you get on track. Give Scalyr a try today.
Thanks for reading, and until next time!