As an engineer, you recognize the complexity of today’s software systems. You probably don’t understand every individual part of your software stack, but you know there are many points of failure. I know that I can’t remember what every part of my company’s software stack works like. That doesn’t mean I’m not responsible for fixing it when it breaks! It’s my job to drill down and determine what part of the stack has come unglued whenever there’s an outage. The same is probably true for you.
When I’m troubleshooting an outage, time of is of the essence. I need to figure out what’s going on right now. What takes me the most time in troubleshooting an issue? Finding out where the problem lies. I usually have a heap of performance and log data to sift through. Most of it is entirely unrelated to the outage at hand, but I don’t know that yet. No, I have to dive into all that data and make sense of it as quickly as possible. I have to try to figure out where the root of the problem lies. But what if I could teach a computer to do that for me?
What Is AI Ops?
That question is at the root of AI Ops. Now, full disclosure: AI Ops is a term invented by Gartner to explain a trend that already existed. As an engineer, some of the things we’ll do with AI Ops are going to be second nature to you already. If you use a tool like Scalyr Log Management, some of this will feel like second nature to you already.
But if we paraphrase the long, extremely convoluted definition provided by Gartner, we come up with something pretty simple. AI Ops is using specialized data processing tools to cut through the layers of data necessary to perform an ops role in today’s dynamic IT shops. Or, to make that definition even shorter: AI Ops is teaching a computer to recognize the same patterns you do in your logs. The benefit to you is that when there’s an outage, you’re less stressed. You’ll need to spend less time figuring out just what broke, and get right to fixing the problem.
How Does AI Ops Work in Practice?
Instead of talking about definitions and trying to imagine how those might play out in your job, let’s try a practical example. In reality, the example I’ve come up with is a bit contrived, but it’s easy to understand.
Let’s say that your database’s network connection has gone down. Let’s also imagine, for a moment, that you don’t work in a shop where there are high-quality monitoring and detection tools already connected. Yes, a bad network connection could be diagnosed by a tool like Ansible. But, for this exercise, let’s use it as a stand-in for a more complicated problem.
Fixing a Bad Connection in a Standard Shop
In a lot of manual IT shops, fixing something like a bad connection is a slow process. First, you have to wait until someone reports it to you. You won’t know anything is wrong until you get a breathless email that “the website is down.” Now, that’s not a particularly useful error message, but it’s up to you to fix things. So, you stop what you’re doing, and type the company’s website URL into your browser. More than likely, the homepage is static or even served by a CDN. You grumble to yourself that the website isn’t down, it’s loading just fine.
Then, you try to log in. Thirty seconds later, you get a timeout page. Maybe the website is a little down.
So, you swing on over to AWS CloudWatch or Google Cloud Logging. These tools will help you diagnose what’s going on. You know that there was a timeout when you tried to log in. Maybe something’s wrong with the load balancer; you dig into the logs. After about two minutes of searching you find the log for your failed login. Sure enough, there’s a timeout error, right there in the logs. This game of hopscotch continues right down the stack. Next up, you have to check out the web server. Then maybe an API server. Then a login service server. Finally, after checking four different levels of logs, you’re finally to the source of the problem: the database.
You’re experienced, and good with your monitoring tools. This whole process might go pretty quickly for you. It takes perhaps twenty minutes. Maybe a little more, or a little less. Eventually, you determine the cause of the issue, and you fix it by a simple database server reboot. You’ve solved the problem, but it was a big disruption.
Fixing That Same Bad Connection With AI Ops
Instead of fixing all of that manually, let’s look at what happens when you have AI Ops plugged into your stack. The story for fixing this problem with AI Ops starts long before the outage. It starts by training the specialized data processing logic that drives the AI Ops platform. You and your team feed log messages into the AI Ops platform and correlate them with different kinds of issues. The platform learns to associate particular types of error messages with particular errors.
Now, you don’t need to wait for a customer to inform someone that the website is down. You, as an employee don’t need to wait to hear it from someone in the business. Instead, your AI Ops platform is able to detect the issue with just two or three failures. What’s better, that platform will correlate logs and error messages across multiple systems. For you to look through logs from each part of your stack might take half an hour. That’s if the problem is simple and obvious, like a database connection drop. When the problem is more complicated, you and I both know it takes a lot longer.
With a powerful AI Ops system, you skip all of those steps. Instead, the system recognizes that those errors on the load balancer are related to the errors in the login service. It presents all of the problems to you as one coherent issue. Instead of spending half an hour tracing the error through your logs, you get a simple, obvious message: your database connection has dropped. You can take action on the root of the problem without needing to spend time investigating. In fact, the business might not ever know there was an outage. You don’t have a significant disruption; your systems have just had a little hiccup.
AI Ops Works so You Don’t Have To
Again, this is an obviously contrived example. A simple database network connection drop isn’t a difficult problem for an experienced ops team to tackle. But its simplicity helps us highlight the difference between a traditional ops team, and one that’s leveraging next-generation tools. The more advanced team puts in work before something breaks. They do this so that when something does break, their tools work for them in the crisis. They’ve already spent time training their models. When the outage comes, their tools alert them before their customers know there’s a problem. Those same tools provide high-quality insights into the entire scope of the problem, and help them identify the solution in seconds instead of hours.
For customers of tools like Scalyr Log Management, this kind of functionality is second nature. They already know about the power that good AI Ops can provide. If you’re not, what are you waiting for?
This post was written by Eric Boersma. Eric is a software developer and development manager who’s done everything from IT security in pharmaceuticals to writing intelligence software for the US government to building international development teams for non-profits. He loves to talk about the things he’s learned along the way, and he enjoys listening to and learning from others as well.