When systems are observable, it’s easier to debug in production. But there are also systems that you don’t control, and you can’t add more instrumentation to increase observability. Debugging is exhausting because you’re working only with the information you have—generally logs. The internals might be obscured.
Examples of obscure systems are a third-party API or a web server like NGINX where you only have the information these systems emit. You only consume these systems as a user, and they’re like a black box. You give them input, and they process it and produce an output. What happened inside the box is unknown, and that’s hard to monitor—sometimes useless.
In today’s post, I’ll continue using the black box analogy from the software industry to explain how it’s applicable in debugging production systems. I’ll also include the golden signals from the SRE model to reinforce how you can keep track of opaque systems. You don’t need to keep track of and monitor everything, so I’ll also explain where using black box monitoring makes sense.
Let’s start.
What’s Black Box Monitoring?
The first time I heard the black box analogy was when I learned about white box and black box testing. White box testing is where the tester knows the internals of the software. The type of testing that fits into this category goes beyond the user interface, like unit and integration testing. Black box testing, on the other hand, is where the internals of the software are unknown to the tester. Therefore, a tester only tests software behavior. Testers often include integration, acceptance, and user interface testing.
When we apply the same analogy to how to run production systems, you replace the word “testing” with “monitoring.” Nowadays some say that monitoring is dead, and that’s why the term “observability” has become very popular. Consequently, white box monitoring is where you know the internals of the system. And the system has instrumentation in place to emit telemetry—metrics, logs, traces, etc. Therefore, you can understand and debug better by asking questions from the outside to understand its internals. This is observability.
Black box monitoring is where you don’t have control and don’t know what’s happening inside the system. You only monitor the system from the outside—its behavior. By doing this, you see ongoing problems in the system.
James Turnbull, the author of The Art of Monitoring, says, “Black box monitoring probes the outside of a service or application. You query the external characteristics of a service: does it respond to a poll on an open port and return the correct data or response code.” For example, you could perform an ICMP check or a health check to an API to confirm it’s responding correctly.
Let’s get into more detail on how you can track opaque systems.
Tracking Opaque Systems
A few options are dependent on the system you need to track. Here are a few examples of how you can debug production systems that you don’t own, as well as systems that you own but want to have self-healing capabilities and increased reliability. Before I continue, I want to include a reminder from the SRE book about monitoring: “Your monitoring system should address two questions: what’s broken, and why? The ‘what’s broken’ indicates the symptom; the ‘why’ indicates a (possibly intermediate) cause.” The following techniques can help you to know the “what” in a system.
A Few Suggestions
The first on the list is a probe. A probe can be a ping, an HTTP health check, or even an integration test that mimics a real user. Probes are commonly used in load balancers, Docker, and Kubernetes. When you use probes, you can automate remediation tasks—for example, taking a server out of rotation, restarting a service, or redeploying an app. But it’s also pretty common to configure probes that ping your applications from outside your infrastructure—as a real user will do. For example, you might set an HTTP request from another region in Azure if your application is running in AWS, or vice versa. Google has an open-source tool called Cloudprober that you can use for black box monitoring in GCP and in other clouds too.
Another option is to use a set of integration tests that run all the time to test the system from a behavior perspective. You might decide that certain rules or scenarios should be running correctly all the time. For instance, you can have a test that creates a user, searches for a product, adds it to the cart, and makes the purchase.
Lastly, you can implement different deployment strategies like canary releases, A/B testing, feature flags, or splitting traffic to safeguard the majority of the users and minimize the impact if things don’t go as expected in the release. By doing this, you’re safeguarding your service-level objectives (SLOs). Making sure that your application is running under your SLOs could have a positive impact on revenue. Sites like Skyscanner don’t include offers from companies where their APIs are slow or not responding.
The Four Golden Signals
Practical examples where you can apply the black box analogy when running production systems are the four golden signals from the SRE book. However, these signals might apply only if you’re working with a system that runs on the web—which might be everyone nowadays. The four golden signals that Google recommends to track are latency, traffic, errors, and saturation. These metrics will help us to understand how users perceive the health of our systems. When they say the system isn’t working, they might be referring to one of those metrics—or all of them.
Latency is the time it takes to fulfill a request from the user. Like the example I gave before from Skyscanner, latency is critical in those cases. You’d know if the latency number is good when you’ve defined an SLO. If the latency goes above the limit, you need to react. Traffic is the demand that the system is receiving—for example, how many requests per second or how many read/write operations the system is getting. Usually, this has an impact on the other signals. Errors are, well, failed requests from users because there’s a problem in the code or the system ran out of capacity. Saturation is how much capacity the system has left before it starts crashing. Based on the current status of the system, you can predict upcoming known and predictable problems.
By keeping track of these signals (remember, they might be different depending on the system type), you can later automate actions like scaling the infrastructure in or out. Speaking of automation, this brings us to the last section to discuss under which circumstances it’s healthy to use alerts.
Actionable Alerts
Now that you know what black box monitoring is, when it’s useful, and the techniques you can use to debug opaque systems, it’s time to talk about alerts. I mean, you need to do something with the feedback you’re collecting, right? Well, historically, the problem with alerts is that they tend to be noisy. And depending on the load, you might ignore some alerts that weren’t noise but real problems.
As a starting point, you need to define your SLO. And make sure you use signals (as per our previous section) that can tell you the user’s perspective. You can use the signals that Google recommends for web systems, or you can define your own. The point is that you don’t monitor everything because that’s overkill. Then, you only alert if the signals go out of the SLO and if you need a human to apply intelligence to solve the problem.
There might be times where the signals go out of the SLO, and you can automate a solution. Most of the time, problems can be resolved by scaling out the infrastructure or restarting a service. It might not be the best solution, but you can work on how to improve the system later. Truth be told, you can improve systems if you don’t have time due to attending to noisy alerts. SREs in Google dedicate 50% of their time to daily activities like being on call and the rest to increasing the system’s reliability.
What’s Broken?
I like how Google puts it. They focus on the “what” first so that they can find out the “why” of a problem. Black box monitoring can help you to understand what’s broken, especially from the user’s perspective. Systems that are opaque and might be less observable will require you to find out their status by different methods. One way is by using probes. You use the application as a real user and can introduce automation to have more reliable self-healing systems—and help you to have happy customers.
Black box monitoring is about enabling you to make changes to the system with more confidence. There’s always going to be “something” that’s testing (or monitoring) the system from the outside. You’ll notice if there’s a problem that you can remediate, or prevent.