Logs, metrics, and traces are often regarded as the three pillars of observability. When you work with them separately, your systems don’t become more observable. And worse, if you have to use three different tools every time you need to troubleshoot in production, you’re going to have a hard time finding the problem.
In this post, I’ll explain what each of these pillars is and under which circumstances they become useful. I’ll also include a few suggestions on how you can start implementing them. Some people think that there are no three pillars of observability, so we’ll cover their opinions too.
Observability Pillars
Distributed systems have pushed us to have our systems be more observable. When you work with multiple microservices developed by different teams, trying to find where the problem is could be annoying. I’ll never forget a time when I had to debug systems by SSHing into the servers and parsing the logs. We were monitoring standard infrastructure metrics like CPU, memory, and networking. But each of them were telling us everything was fine, while an external health check tool was telling us that the system was intermittently down.
We evolved our system by centralizing logs on one tool and still using metrics from another monitoring tool. We always had a hard time when an unknown fault appeared. Having to coordinate different tools—each of which was giving us slightly different information—was painful.
If we take a holistic approach when working with logs, metrics, and traces, we can ease the pain of debugging. Instead of having to implement a different solution for each pillar, as if they were isolated sources, implement a solution for the whole system. When you forget about the context, you end up patching the system and accumulating technical debt.
The first source of information we commonly use to find a solution is metrics.
Metrics
A metric is a numeric value measured over a period of time. Metrics consist of a set of attributes like name, timestamp, value, and labels. For example, the average CPU consumption from the last five minutes of a server is a common metric. DevOps, SREs, and sysadmins use metrics to trigger alerts when a number goes above a certain threshold. Metrics are part of the SRE model and help to define service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA).
We use metrics to determine the health of the system. Metrics are known for describing resources status. But you can instrument your code with libraries like OpenCensus where you can emit custom metrics to get better insights about the system. Nonetheless, there’s a caveat with metrics. When monitoring metrics that trigger an alert, we’re looking at known problems that have happened in the past. Ideally, we know what the implications of a metric going above the threshold are, and we know how to fix it—again, ideally. Other times, what happens is that our metrics are telling us that the system is healthy, yet some users keep complaining the system is down.
Therefore, metrics alone are not sufficient. You need to have more context. (One way you can add more context to metrics is by adding tags.) So, the next valuable source of information to observe and debug systems is logs.
Logs
Logs are the information we look at only when things are bad. A log is a text line that describes an event that happened at a certain time. Depending on the system that produces the logs, logs sometimes come in a plain text format—although the trend now is to provide structured logs so that they can be parsed easily to then run queries to debug effectively. A log consists of a timestamp and a payload that helps give more context about the event.
You can get logs from anywhere. There are specific applications that emit logs to standard locations like /var/log/[service] in Linux. Each application has its own format. For example, a log in NGINX looks like this:
13.249.65.159 - - [06/June/2019:19:10:38 +0600] "GET /blog/ HTTP/1.1" 200 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
You can see a pattern from the above log: white spaces separate each property, but the “datetime” field is enclosed by brackets ([ ]). Log format is the most challenging part when working with logs. And what about the custom logs that the applications you build can emit? Well, you can instrument your code with OpenCensus and define your own “standard” format.
When you see a pattern in the metrics, or you hit a certain SLO, you can investigate further by reading logs. For example, you might be experiencing a considerable number of HTTP errors. Well, to understand what might be causing this to happen, you query the logs—in a centralized location—to look for events that have an HTTP error code of 500. By reading the stack trace error, you might have a better idea of what’s happening and be able to solve the problem.
What I described above is what observable systems are about—you don’t have to SSH the servers to know what’s happening. You have to be able to observe production systems by asking questions from the outside. And you can get good answers to questions when you have traces of a request available.
Distributed Tracing
A trace is a span that represents an execution of code. A trace consists of a name, an ID, and a time value. When you combine different traces from a distributed system, you can see an end-to-end flow of an execution path. By using traces, you can know which part of the code in the system is taking more time to process inputs.
Traces are useful when you want to fix latency issues. I included the word “distributed” in the subhead above because traces are more useful in distributed systems where it’s hard to connect user calls. The way traces interconnect is by passing a unique ID between system calls.
As you can derive metrics from logs, you can also derive traces from logs. And you can instrument your code with OpenCensus too to produce traces in your distributed system. That way you don’t have to worry about adding extra code to pass request headers to connect traces.
I’ve personally used traces to identify which microservice was having problems. For example, I’ve been able to find out that sometimes a service is having issues with external dependencies like Redis. I knew it was Redis because the logs were telling me that.
Opinions About the Three Pillars
Even though this concept of the “pillars of observability” might have started with this post from Twitter (which also includes alerts as a fourth pillar), there are some people who think differently. Some say that there aren’t three pillars of observability. Others say that the three pillars have no answers. You can read their individual posts and tweets for details. But what they’re against—and I am as well, especially with how complex applications become in distributed systems—is that it’s common that there’s one tool per pillar, so people end up with three tools to debug systems.
I like how Cindy Sridharan puts it in her Distributed Systems Observability report. She says, “While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that, if understood well, can unlock the ability to build better systems.”
Logs, metrics, and traces are the sources of information that make systems more observable. It doesn’t mean that you necessarily generate metrics from logs or traces, and then create actionable alerts. Even if it’s possible, sometimes you’re going to be working with systems you don’t own. You’ll have to work with the information they provide to troubleshoot.
Build Better Observable Systems
The takeaway here is that you know there are different sources of information that you can get from your systems. In some cases, you might need to add instrumentation. But I’d recommend staying away from having different tools to debug your production systems. Especially, consider that having a centralized location for storage and query will be more effective when solving problems. When systems are down, you need to find answers by asking questions as quickly as possible.
You can build better systems when they’re observable, and the good news is that you can start with the data you have. You’ll always find better sources of information on the road as you learn to keep observing and continuously improving.