Have you ever had a near miss car accident?

Background

I definitely have. In one particular example I fell asleep on a highway and started drifting off the road. I remember it super well -- the terrifying moment of waking up as the car hits the gravel and then the grass, and I wake up and slam the brakes. It was mere chance that neither I nor anyone else was killed.

It might be bizarre then to learn that I think I'm a good driver. Aren't we all? But if I am a good driver, what lead to this remarkably stupid situation in which I only survived by sheer luck? Well, it was a culmination of factors, each of which by themselves did not cause me significant stress:

  • I worked as a fitness trainer. This means 5am starts.
  • I worked a hundred kilometers from where I lived. This I would leave home at ~3:30am.
  • I was at university, and this was coming to exams. Any spare time I would have, I'd elect to study and not sleep.
  • I was working several days in a row, and using caffeine as a stimulant to keep me awake. On that day, it failed.

But all added up together? Well, I ended up having a car accident.

That's the point of this article. Bad things happen to all of us. Especially when we're juggling a tonne of different things, up to and including things that we no longer entirely understand. These days, I work as a software developer, specialising in the deployment and management of e-commerce software. Guess what? Bad things still happen.

They're different, though. Instead of having car accidents, something will break in a client system. Some clients turn over massive amounts of cash, and such a breakage will cost 10's of thousands of Euros. They'll lose client trust, customer trust and massively increase developer stress as they attempt to hastily repair the system to the best of their ability. If they're lucky developers or happen to have an experienced colleague, it's often not that bad - between 5 and 10 minutes of downtime, and we're able to restore a system to some normality.

However, in both cases - the car accident and the broken system, the more important thing is to decide what happens next.

It's extremely tempting to simply mark the issue resolved. I mean, the machine is back up right? And I survived my car accident. Hell, there wasn't even any damage to the car. But such emergency response is exhausting, and the prospect of facing it again is simply more than I want to deal with. So, we undertake an exercise called a "blameless post mortem".

The name derives from the macabre process of identifying how someone died by examining their body; also known as an autopsy. In the same process, it's good to take the time and try and understand the circumstances that arose this situation, so they can be addressed and we can live less stressful lives in future. Our current template is characterised by the following sections:

  • Abstract (short summary)
  • People (the list of people in the article)
  • Timeline (a list of the events that occurred before the incident and that were taken to resolve it)
  • Impact (a list of the problems that the incident caused)
  • Contributing Factors (a list of the things that might have contributed)
  • Authors (The people who helped write the post mortem)
  • Thanks (Acknowledgement of the help of others)

The goal of these is to take emergency incidents and learn from them, so they can be avoided in future. As an example, let's do a post mortem on my car accident.

Post Mortem of Car Accident, 2010-01-01

Abstract

On ${DATE}, Driver A fell asleep at the wheel. The car then drifted off the highway, at which point Developer A awoke and rapidly depressed the brake pedal. The car spun out and came up to a stop. Operator error is at fault here, however there are various causal issues that lead to the degraded condition of the operator including fatigue, unreasonable work expectations and drug (well, caffeine) abuse. Impact was minimal, Driver A was late to their employee however this is largely due to causal. Possible future impacts could be death, or death of another.

This post mortem requires follow up in 30 days ${DATE + 30}

Timeline

2010-01-01 03:00:00 - Driver A awoke, feeling tired
2010-01-01 03:30:00 - Driver A felt tired, but stopped at a truck stop to order an extra large coffee
2010-01-01 04:00:00 - Driver A fell asleep at the wheel, leading the car to spin out

Impact

There was little impact to this incident. However, this is likely due to chance. Likely impacts in a more residential area would include death of some sort, property damage.

Causal Factors

There were several causal factors that contributed to operator error. Each issue is tracked and should be fixed by the review date.

  • Unreasonable work schedule
    • Resolved. Found new employment
  • Long distance between work and home
    • Resolved. Found employment closer to home
  • Too many commitments
    • Resolved. Terminated university
  • Caffeine abuse
    • Unresolved.

Authors

  • Andrew Howden

Thanks

  • God

Learnings

As you can see, I found a new job, moved house and ... well, I still drink too much coffee. I also no longer own a car. But by doing this after action analysis we have dramatically reduced the likelihood of another incident, as well as my stress. The same thing happens when talking about our software management process.

The process of learning is a continual one. But by carefully inspecting our failures, we can grow and learn and be safer and more secure in future.

Still reading? Awesome! I'm glad you enjoyed the post. There's also some other people who have written on this topic that you might enjoy reading about:

  • https://landing.google.com/sre/book/chapters/managing-incidents.html
  • https://landing.google.com/sre/book/chapters/postmortem-culture.html
  • https://codeascraft.com/2012/05/22/blameless-postmortems/
  • https://github.com/etsy/morgue

If the technology is something that interests you and you're enjoying learning about how these things work come join us! We're always looking for ll teach you as much as we can, and give you the tools required.

If you're not interested and just want to be a consumer of these awesome ideas, I suggest you come work for us. We're passionate about building beautiful, reliable web technology and we're given the agency to actually do it.

Thanks for your time. <3