In order to understand the idea of log aggregation, you need to understand the pain it alleviates. You’ve almost certainly felt this pain, even if you don’t realize it.
Let’s consider a scenario that every programmer has probably experienced. You’re staring at some gigantic, dusty log file, engaging in what I like to think of as “programming archaeology.” And you have a headache.
Log aggregation is the practice of gathering up disparate log files for the purposes of organizing the data in them and making them searchable.
In this post, I’ll walk you through what, exactly, that looks like. But I’ll also describe the backstory and motivation for doing this, both of which are essential to really understanding how it works.
So with that in mind, let’s consider a programmer’s tale of logging woe.
A Tale of Logger Woe
It started innocently enough.
A few users reported occasionally seeing junk text on the account settings screen. It’s not a regular bug, and it’s not particularly important. But it is embarrassing, and it seems like it should be easy enough to track down and fix. So you start trying to do just that.
You start by searching the database for the junk text in their screenshot. You find nothing.
Reasoning that application code must somehow have compiled the text in production, you figure you’ll head for the log files. When you open one up, and it crashes you text editor. Oops. Too big for that editor.
After using a little shell script magic to slice and dice the log file, you open it up and search for the text in question. That takes absolutely forever and yields no results.
So you start searching for parts of the text, and eventually you have some luck. There’s a snippet of the text on line 429,012 and then another on line 431,114, with all sorts of indecipherable debug junk in between.
But you can’t find all of the text. And you have a headache.
You then realize there’s a second log file for certain parts of the data access layer from before the Big Refactoring of ’15, and the rest of the text is probably in there. Your headache gets worse.
Log Aggregation to the Rescue
You don’t need a name for it to know that a better approach has to exist.
Of course, you’ll pull levers in your own code and configuration files first. Once you’ve tracked down that maddening issue and sorted it, you’ll parlay your lessons learned into suggestions for the team.
From now on, we should really turn off all debugging info in prod because it’s just noise. And we should really audit our codebase to get rid of spurious calls. Oh, and while we’re doing that, we should establish some standards for the information that goes onto each line and consolidate to a single line. That should do the trick.
Or should it?
With these types of approaches, you’re addressing a poor signal-to-noise ratio. Frustrated by what you deem worthless information in the log file, you seek to organize and reduce the raw volume. You cut down on the total noise, hoping to leave only signal.
But here’s the trouble.
Your noise in solving the junk text problem may prove to be someone else’s signal next week when tracking down a different problem. If you’re a code archaeologist, those log entries are your fossils; you don’t want to toss them in the garbage because they’re not helping with your project right now.
You don’t want to put your logs on a diet. Rather, you want to get better at managing them. That’s where log aggregation comes in.
Aggregating Your Logs
Today you have that main log and the other leftover one from the days before the Big Refactoring of ’15. You want to consolidate those in application code to make life easier. But then again, you’ll still have the entirely separate server log to deal with in some cases.
And what about the inevitable wave of consultants that come in and tell you to break your monolith into microservices? All your log file consolidation efforts become moot as your application becomes a bunch of small applications.
The real solution to your problem lies not in an enforced standard of dumping all information into a single file.
Instead, you want to find an efficient way to gather the entries from your various log files into one single, organized place. That may seem like a lot of extra work for you, but it really isn’t because someone has solved this problem already — and solved it well.
Tools exist to handle your log aggregation.
And Parsing Them, While You’re At It
Introducing log aggregation tooling will become a game changer for you. If you think ahead to the sorts of things you’d want following an aggregation, the tool has already taken care of them.
For instance, you probably think, “Well, slamming the log files together is all well and good, but all the different formats would just get confusing.”
And if you roll your own solution, that’s absolutely true. At least, until you write some sort of parser to extract structured data.
But people have already written that, and it comes along for the ride with log aggregation tooling.
There’s an important concept at play here. Generally, developers treat log files as text and do simple searches. But there’s data in those files, waiting for extraction and meaningful ordering. The aggregator treats your log file as data and lets you meaningfully query that data.
Real-Time Monitoring
Chances are you’ve logged into a server somewhere and issued a command like “tail -f some.log.” And I’m betting that’s the sum total of the real-time log monitoring that you’ve done.
With a feature-rich log aggregator, you can achieve this same effect.
But with the aggregator, you can bring all of the structured data and gathered log files along for the ride. So instead of a scrolling wall of text, you can keep your eye on a scrolling set of organized, meaningful, color-coded entries. In this sense, it’s a lot more like looking at a dashboard than a text file dump.
Intelligent, Fast Search
All of that sets the stage for truly alleviating the headache of the code archaeologist. You can get all of the log data in one place, parse it into meaningful data, and keep an eye on it. So, not surprisingly, you can get a lot more sophisticated with your querying than simple text searches.
The log aggregation tool treats your code as data.
That means conceptual schema and indexing. Put another way, that means that you can execute semantically meaningful searches that are also fast.
Forget relying on your text editor’s wildcard/regex feature to make your search smart and then going for lunch while it cranks through a 10-gig log file. You can look for things based on the nature of the data in question, and you can do so quickly.
What Kind of Data Is Captured In Log Aggregation?
The previous sections covered what I could call the steps of log aggregation. You’ve learned that the log aggregation process occurs in some stages. It all starts by collecting logs, as you’ve seen it. However, given the sheer number of conflicting formats the different types of logs employ, it’s essential for you to parse your logs, so you can extract useful data from them, despite their formatting choices.
After you have all of your log data centralized into a single location, then you’re golden: you’re now ready to perform all kinds of useful things, such as real-time monitoring and fast, efficient searches.
Speaking of data…what kind of log data is captured during log aggregation after all? The answer is: all kinds.
You see, the real value of log aggregation comes when you have log events from all kinds of sources be centralized into a single place. Sure, that definitely includes logs from applications—in our example, logs from our app, from before and after the infamous Big Refactoring of ’15.
But log aggregation captures way more than just application logging. Here’s a non-exhaustive list of types of logs that might get aggregated:
- Web server—the hardware type—logs.
- Database logs.
- Web server—the software type—error logs.
- Web server—the software type—access logs.
- Operating system logs.
Your log aggregation strategy must also take into account all of the different possible destinations for logs. While we routinely talk about log files, there’s nothing really preventing logs from being written to all kinds of different targets. Though text files on the disk are certainly one of the most popular destinations, database tables aren’t that far behind. Other common destinations include the console, NoSQL databases, or even cloud log aggregation services.
Finally, as you’ve probably aware, a common and useful way of categorizing log events is through the use of logging levels. When you factor that in, it becomes clear that log aggregation should capture log events of all different severities. Sure, it’s a best practice to deactivate log levels such as TRACE and DEBUG—or, in other words, anything lower than INFO—in production. Apart from that, any log event that makes into production should be captured during the log aggregation process.
The Three Levels of Log Aggregation
The post up until now has followed a pretty conventional what-why-how structure. Well, sort of. I’ve started with the “why”, by employing a fictional story to paint a vivid picture of the pain that log aggregation is supposed to heal. Then, I’ve talked at great lengths about the “what”, covering in detail the components or steps that log aggregation involves. However, the “how” part is still missing: how do you actually do log aggregation?
That’s what I answer now, by talking about three approaches to log aggregation that differ by their “maturity level”, so to speak.
Level 1: The Homegrown Approach
The first level of log aggregation is something that you could accomplish maybe in an afternoon. You could leverage basic and universal tools, such as rsync, and come up with a simple yet effective solution for synchronizing log files to a centralized location. A proper log aggregation approach has to be automated, though. So a next step would be to use something like cron so the process can be triggered without human intervention.
Such an approach would have serious downsides, though. First of all, it’d only be able to centralize log files, but no log entries in other formats, such as rows in relational database tables. Also, since it relies on scheduled jobs to bring all of the logs together, this approach can’t offer any real-time features, failing on what can be considered one of the killer aspects of log aggregation.
Finally, while this approach can be considered log aggregating in the literal sense, it fails to offer the most valuable features of what is called log aggregation, such as the already mentioned real-time monitoring capabilities.
Another important aspect that would be missing is a fast and efficient search. Sure, you can use techniques like using the grep command along with regular expressions, but the usefulness of that approach quickly fades as your log data reaches massive volumes.
Level 2: Leveraging On-Prem Tools
As soon as they realize the limitations of trying to come up with a homegrown log aggregation solution, most people will think of a proper log aggregation tool as the next logical step. There’s a wide array of log aggregation tools available. You have open-source tools you can download and use for free. There are also commercial tools that you must pay to use but offer additional benefits such as support.
Perhaps a more important way of categorizing log aggregation tools relates to whether they have to be downloaded and install in the client’s own server. These tools are often called on-prem or on-premises solutions since they’re installed on the client’s actual IT infrastructure.
I consider those to be the second level of maturity when it comes to log aggregation adoption. Though leveraging an on-prem tool is certainly way better than going with a homegrown solution, it’s still not the best available option. Why is that the case?
There are many reasons for that, which could probably fill a post of its own. For brevity’s sake, I’ll pick a single reason, which is probably the most compelling anyway: the cost.
Implementing an on-prem log aggregation solution requires heavy investments, not only in infrastructure but also in personnel, since you’ll need to train—or hire—specialized talent.
There’s also the opportunity cost involved since professionals tasked with the log aggregation solution won’t be able to work on other—potentially more valuable—tasks. All of that has to be taken into account to determine the full TCO (total cost of ownership) of the solution. Even if the software itself you choose is free and open-source, that doesn’t mean it’s going to be cheap in the long run.
Level 3: Cloud-Based Log Aggregation Tools
The final level in a log aggregation journey is learning to leverage cloud-based log aggregation tools. These kinds of solutions are engineered to be scalable from the ground-up. They allow organizations to get started with log aggregation quicker since they don’t require as much preparation as their on-prem counterparts.
When it comes to performance, cloud tools are usually better than the on-prem solutions, since they’re built to be as fast and efficient as possible regardless of log data volume.
Despite all of the advantages of cloud-based aggregation tools, at the end of the day, the most compelling factor is exactly the one you’re thinking about: money.
Cloud solutions are the way to get the most bang for your bucks. They help you reduce cost on a lot of fronts since they don’t require heavy investments either in infrastructure or personnel. Since they facilitate monitoring—especially when dealing with distributed architectures—they help teams diagnose and fix problems faster, which prevents the organization from losing money down the line.
The Value Proposition of Log Aggregation
Everything that I’ve talked about so far, you can think about in terms of features. Literally speaking, log aggregation just means “gathering log files into one place.”
Log aggregation tool makers have taken that to its logical conclusion, adding things like parsing, search, and indexing. You now have sophisticated, robust options to help you keep track of the information that your applications leave behind as they run.
As I’ve covered in the first level of log aggregation adoption maturity, coming up with an effective yet primitive approach based on simple and universal tools falls into that “literate” description of log aggregation. It might be log aggregation technically speaking, but it lacks most of the attractions that most people associate with log aggregation.
But getting to the real core value proposition — the “what’s in it for me” angle — requires you to consider these features as a whole. It means you have to think of the code archaeologist with the headache.
That developer wades through a swamp of noise, looking for a signal. Log aggregation turns the log files into proper data, thus taking the noise and hiding it until you need it. Left with only the signal, you can now use your logs as a quick and efficient tool for chasing down production issues — without any headaches.
We’re approaching the end of the post. Hopefully, by now, I have convinced you of the importance of log aggregation. If I did my job well, you’ve made your mind about getting started with log aggregation and are now wondering about what the next steps might be.
Well, if you’re set on adopting log aggregation, the next logical step is picking a tool. And if tool-shopping is actually the current stage in your log aggregation journey, we invite you to take a look at Scalyr.