Ah, the software defect. It’s the bane of our collective existence, and it also seems unavoidable. Okay, frankly, it probably is unavoidable, for all intents and purposes. But that doesn’t mean we’re powerless to do anything about it. We can chip away at its impact by reducing its severity and by shortening the defect life cycle.
What Is the Defect Life Cycle?
Muting the impact of defects is a self-explanatory endeavor, but what do I mean by “defect life cycle”? Well, first, consider the word “cycle.” This borrows from the lean idea of cycle time, which is fancy Six-Sigma speak for “How long does it take from start to finish?”
Okay, so why not just call it “defect lifetime”? I suppose I could have used that term. But it omits a subtle yet crucial consideration. A defect in our software moves through a series of phases and steps as people work to correct it.
“Lifetime” makes it sound like fate gives birth to the defect, and then it simply exists until it naturally expires. But that’s not at all what happens. Instead, the team collaborates methodically to track down the defect, assess it, address it, and roll out a fix of some kind. So we think about a defect life cycle rather than a defect just living its life quietly out in the country somewhere.
But terminology and philosophy aside, how do you shorten the defect life cycle?
Production defects tend to generate stress and keep people up at night. From the moment a user reports it until the moment someone resolves the problem, tensions run higher. Let’s take a look at how to reduce the length of that tense time.
Remove Bottlenecks and Get Your Communication on Point
First up, go pick the absolute lowest hanging fruit. You’ll be surprised how many opportunities exist.
Any significant software operation will use something to track and report on defects, such as JIRA. (If you aren’t using something like this, start doing so immediately.) Typically, this involves communication among a number of different parties.
Audit the life cycle of your defects. Do they go into a “reported” queue and wait for two days before anyone pays attention? Do they languish in the “assigned” state for days before an engineer has time to take a look? I can’t enumerate all possible bottlenecks here, but you get the idea. There will be pockets of slack time in your process. Find them.
You can think of this activity as “a day in the life of a defect.” Follow your defects from start to finish, looking for lags, communication inefficiencies, and confusion. This activity, requiring no tech changes whatsoever, can help a lot.
Embrace DevOps Culture
In some ways, you could think of this as a more specific case of communications improvement and bottleneck removal. But it’s worth calling out in and of itself.
No doubt about it — DevOps has become an incredibly high profile buzzword. Separating the signal from the noise around it can be difficult. I won’t wade in too deeply here except to suggest that you think of DevOps as a culture.
It’s a culture in which the development organization shares responsibility for what happens to the software in production. And it stands in contrast to bygone days when most shops had developers who worked on software until it went live and then went on to do other things. From there, operations and maintenance programmers took over.
If you’re still working this way, you’ll have a needlessly protracted defect life cycle. Defects will literally have to cross departmental boundaries in order to find resolution. By making your group responsible for what happens in production, you’ll see faster resolutions.
Make Use of Alerting
Alright, we’re through the management theory portion of the post. Let’s move on to technological solutions.
First up, think about how you can make use of alerting technology. Traditionally, organizations learned of defects via angry users hurling invective through the “report an issue” link. But before that angry user ever decided to let you have it, the underlying defect had existed in production for hours, days, or even weeks.
But you didn’t know it.
You could have surmised it, though. Maybe traffic to some endpoint dropped to zero, or transactions of some kind stopped happening. If you’d been monitoring your production application for unusual or troubling behaviors, you might have learned something was up before angry users told you.
So look to create alerting capabilities around your production code. Get out in front of things, and you’ll get a head start on your defects.
Make the Language of Your Logs Understandable
Log files contain all of the mysteries of your application’s behavior as it runs. Making readable log files is critically important if your team is to have any hope of using them to track down defects.
But it’s also not nearly enough.
You need to aggregate all of your log files together into some coherent structure and then give yourself the ability to search them intelligently and quickly. (Hint: you need tooling to do this effectively.)
Doing this dramatically reduces the time required to reproduce a defect. And anyone with significant troubleshooting experience knows that reproduction is often the hardest, most time-consuming part of the defect life cycle.
Write Clean Code
The last two suggestions were things that you can use tooling to do around your application in production. Now I’m going to offer some more invasive suggestions — what to do to your application, in terms of your code and deployment pipeline.
First of all, write clean code. There’s obviously a degree of subjectivity to this, and it may be a daunting training issue. But make sure that you’re writing code with the understanding that many more people will read it and try to understand it than will ever write it. Look to make your code maintainable, and both reproduction and fixes of defects become much easier.
Change the Way You Deploy: Granularity, Flags, and Gradual Rollouts
You have another powerful arrow in your quiver if you’re changing the way your development and deployment work. You can change the way you build and deploy your software.
Think of how tech titans like Google and Facebook release software. They make their releases much more granular, going down to the feature level. And then they release “darkly” and without fanfare. They do this using a sophisticated implementation of the concept of feature toggles. At first, nobody sees the new features, and then they start gradually turning them on for more and more people.
As long as the experiment goes well, they keep widening the pool. But if they hit snags, they roll back quickly.
Of course, the “roll back quickly” has important ramifications for the defect lifecycle. But think even beyond that. With this style of deployment, you’re giving yourself much more flexibility across the board in production. You have the capability to make correcting defects in production almost as easy as correcting them in a sandbox developer environment.
An Ounce of Prevention is Worth a Pound of Cure
I’d be remiss if I didn’t pull back from the tactical and end strategically. Shortening your defect life cycle is good business. It’ll make your customers happy, which will make you more profitable, which will earn promotions for people. You should do this because it’s a win for everyone.
But as you’re doing it, you should also think about how to prevent defects in the first place. Sure, things will always find a way not to go according to plan. But that doesn’t mean that giant lists of known defects are inevitable either.
As you implement changes to reduce the defect life cycle, ask yourself how you can also apply those pieces of thinking to the cause of prevention.