Every organisation that makes software, makes mistakes. Sometimes, despite everybody’s best efforts, you end up releasing a bug into production. Customers are confused and angry; stakeholders are panicking.
Despite the pressure, you knuckle down and fix the bug.
Now it gets interesting: you have to deploy your fix to production. Depending on how your organisation works, this could take anywhere between a couple of minutes and a couple of weeks. You might simply run a single command, or you might spend hours shuffling emails around between different managers trying to get your change signed off. Meanwhile, your customers are still confused and angry. Perhaps they’ve started leaving you for your competitors. Perhaps you feel like leaving your job.
With the bug fixed, you sit down and try to learn some lessons. What can you do to avoid this from happening again?
There are two very different ways you can react to this kind of incident. The choice you make speaks volumes about your organisation.
Make it harder to make mistakes
A typical response to this kind of incident is to go on the defensive. Add more testing phases, hire more testers. Introduce mandatory code review into the development cycle. Add more bureaucracy to the release cycle, to make sure nobody could ever release buggy code into production again.
This is a reaction driven by fear.
Make it easier to fix mistakes
The alternative is to accept the reality. People will always make mistakes, and mistakes will sometimes slip through. From that perspective, it’s clear that what matters is to make it easier to fix your mistakes. This means cutting down on bureaucracy, and trusting developers to have access to their production environments. It means investing in test automation, to allow code to be tested quickly, and building continuous delivery pipelines to make releases happen at the push of a button.
This is a reaction driven by courage.
I know which kind of organisation I’d like to work for.
I agree with this dichotomy, though I think it depends quite a bit on the organization. It’s easier to optimize for fixing mistakes when you’re a small company and less is on the line (which is why it kills me to see small companies taking the former approach). The consequences of mistakes for larger companies with huge customer bases are more uncertain and hence scarier.
I also think some people reading this will naturally agree while others will naturally propose counterexamples. But that’s just how people think, coming from our own experiences. Personally, if I make a really dumb mistake, chances are I have an intuitive sense of whether that was totally stupid and easily avoidable or if such mistakes are likely to happen no matter what safeguards are in place. I would rather prevent mistakes where preventing them is easy and obvious and imposes little or no friction on the organization in general. But I’d like to think that most of the mistakes I make aren’t like that; hopefully, they’re more often subtle mistakes that would have been difficult to anticipate. So I lean towards the latter approach.
I guess what I’m saying is that the way a person leans is likely to be shaped by what sort of mistakes he/she has generally experienced and observed. If most of the mistakes you observe seem to stem from stupidity, you might think the first approach actually makes sense.
Why does it have to be one response or the other. Why not both at a measured level?
Remember the answer is always: It depends!
I understand you’re probably talking about the reaction in general here, rather than specifics, but I think it is important to realise that fear is sometimes an appropriate response (I would prefer you were fearful of releasing something that would expose my credit card details for example).
I think the interesting thing is more that whilst one view of the world may be most appropriate for a particular style of project, figuring out how to recognize when the opposite approach is necessary on specific occasions, and how to deal with that in the team culture.
Reminds me of discussions about MTBF (mean time between failure) vs. MTTR (mean time to recover) . Turns out that things eventually do go wrong and being able to recover quickly and gracefully (and to know that things have actually gone wrong in the first place) is pretty cool. Who’d have thought!
With some products customers want to test each release couple months before using them in production. Then it makes sense to ramp up testing. However, often the gut reaction is to add (non-autotested) specifications instead of testing.
Leave a comment