Today I’m going to explain why controllability is vital in your production systems and why having observable systems is critical. Controllability is a corollary concept related to observability from the control systems theory. Nowadays, people talk a lot about observability in operations and production systems. But these terms have been applied to software testing for a few years now.
By no means am I trying to add a new buzzword to the industry—gosh, there are already so many! But by taking a look at the concept of the term in the context of production systems and how it relates to observability, I’ve found it interesting to write about the practices behind controllability.
In this post, I’ll cover what controllability is and how it relates to production systems. Then I’ll share a few practices and recommendations you might find useful for controlling your systems.
So let’s start with the definition of controllability.
What Does Controllability Mean?
Let me use the controllability definition from Wikipedia, which is very simple and straightforward:
Controllability is an important property of a control system, and the controllability property plays a crucial role in many control problems, such as stabilization of unstable systems by feedback, or optimal control.
We all know that, most of the time, systems become unstable when a deployment happens. Therefore, most people prefer to defer deployments as much as possible. But DevOps and SRE changed that mindset by embracing the risk of things going wrong. The key here is how fast you can react when the system is unstable and then bring it back to its optimal control.
One way for you and your team to know that something is wrong is through the feedback you receive by observing the system’s properties from the exterior. But observing without taking action is worthless. Then you understand its internals and can stabilize the system correctly and effectively—rather than fighting in darkness. In other words, you need observability to have a perception of the system. Without observability, it’s hard to know where you need to take action to control the system.
Controllability is about taking actions based on feedback. What feedback? The feedback you get by observing your systems.
With observability in place, you can better understand the system and be able to control it—especially when a deployment happens. So, I’m going to focus on explaining how controllability is applied to the ability to change the system—during deployments—without affecting the system’s stability, or to at least stabilize it when it becomes unstable.
There are ways and practices you can adapt to your needs to maintain a system in its optimal control. Let me share a few.
An Error Budget Is Useful
Before you start to bring control to systems, you need to know what “normal” looks like. It’s not enough anymore to only care about when the systems are entirely down; those are obvious problems that you need to control. But what about those problems that you might not see but a few of your customers do? If your system isn’t observable, you can’t react quickly to the unknowns that might be making the system unreliable.
Having reliable systems can be hard and expensive. And users might not even notice, depending on the level you define for reliability. DevOps and SRE embrace risk, and the way the SRE knows when to stop or continue changing the system is by using error budgets.
An error budget is the risk tolerance of a service or how much time the system is allowed to be down, so to speak. For example, if your availability target is 99.9% of the time, the error budget is 0.1% or 48.83 minutes per month.
Error budgets enforce teams to do gradual changes in the system. Otherwise, the team might spend all the error budget they have to maintain the system’s stability every time you push significant changes.
Implement Near Zero Downtime Deployment
There are a few techniques and practices that’ll help you push gradual changes in the system. When you need to push a significant change, like changing the architecture, the use of feature flags could help with making gradual changes. Feature flags reduce the chances of downtime in the system because you control the switch (on/off) of the feature to maintain controllability.
Another technique is to practice blue-green (B/G) deployments, where you swap a live environment with the one you’ve been testing without affecting real users. But B/G might be costly, depending on the architecture, which is why another popular technique is to implement canary releases.
With canary releases, you push a small change gradually in the system. A new change (the canary) might be tested first with internal users, then with a small portion of users. If you observe that everything is OK, you do a full rollout, or you can do a rollback if you see problems.
Being able to implement deployments while reducing downtimes will help you maintain the system in its optimal control. You might also increase controllability with practices like infrastructure as code, configuration management, and working with production-like environments.
Build a Fault-Tolerant System
Continuing with the premise that systems will go down and it’s better if you embrace risk, it’s crucial that you build a fault-tolerant system in order to keep a system stable. You can start by thinking about how many ways the system can go down.
As I said before, most of the time, the system becomes unstable when we push a new change. But there are also times when the system goes down for reasons you can’t control—like losing a host machine in your cloud provider or deleting a database server by mistake.
Instead of trying to avoid failures at any cost, I’d suggest you build systems that are resistant to at least the most common downtimes. Cloud providers recommend that you work with immutable infrastructure. Build self-healing systems, especially if you’re working with distributed systems, which is becoming the norm.
For example, Netflix created the project Hystrix to implement the circuit breaker pattern with a focus on service latency. Google, in their SRE books, explains how they increase resiliency by limiting the number of requests to handle the overhead. And Eventbrite has implemented a “waiting room” to handle excess traffic.
When you build fault-tolerant systems, they can auto-recover when instability hits.
Increase Resiliency by Injecting Chaos
In the last KubeCon in Barcelona, Spotify shared a story about the time when they accidentally deleted all their Kubernetes clusters with no user impact. You can hear more details in the recorded talk, but they were able to recreate the cluster in 3.25 hours with their infrastructure-as-code files built in Terraform. But what was vital for them is that they planned for failure. They did gradual changes, and they created a culture of learning.
There’s also a great talk from ChaosConf about embracing chaos by doing GameDays and creating feedback loops to measure resilience.
Accidents happen, and that’s why companies like Netflix inject failure in the system all the time. Netflix does it just to verify that systems are fault-tolerant, and when something terrible happens, it’s just regular business. Not everyone can inject failures in live environments. But we all can learn the value of observing how the system behaves when it becomes unstable.
When you’re still discovering the problems with your system, you might not have controllability at first. But you’ll achieve controllability as you learn and create a remedy proactively. I’d also stress that, to start injecting chaos, you do it in an environment that you can control and where real users aren’t affected. When the process is mature, you might do it in live environments as Netflix does.
Systems Under Control
In conclusion, observability is how well the production system’s states can be inferred from the outside. This type of feedback is needed in controllability. And controllability is how easily systems can be changed with low risk.
Systems can get unstable for a lot of reasons. To have control over it to make a quick fix, you need to be able to reach the bottleneck in the system to make changes to the system. And how easily you can reach a particular state of a system is called reachability. High reachability helps in high controllability.
As I said before, systems become unstable when we deploy a change. And the end goal is not to completely avoid problems but to have a way to stabilize the system—ideally by planning for failure and automating the healing processes. A system will be under control when their state becomes deterministic (no surprises). A system can be controllable if and only if the states of the system can be changed by changing the input to the system.
As the saying goes, “hope is not a strategy.” So plan for how you can keep your productions systems under control. There’s no recipe that you can follow because every organization has its challenges—technically and culturally. Start by identifying what areas you can improve, then change things little by little. At some point, you’ll be able to keep your systems under control every time they become unstable.
If you want to start getting feedback from your systems to improve resilience, give it a try to our platform and discover what exactly is causing problems in your production systems.