Shit happens, it is inevitable. We work so hard to keep things running, with redundancy, automatic fail-over, 99.999% availability, but most of the time outages happen because someone screwed up.
In an unhealthy organization you hang that person and move on. The organization learns nothing and is doomed to repeat the mistake.
In an healthy organization the system is at fault for allowing the person to make the mistake. The system needs to be fixed and each outage is an excellent learning opportunity.
Having a playbook of what to do in an event of an outage is basic. You need to determine what kind of outage is considered an incident, how to discover an incident and how to collect the response team. One thing most teams forget, is that the playbook is useless if
- Nobody knows it exists, or where to find it during an incident
This is why it’s imperative to have fire drills and to practice incidents. Some go as far as actually bringing down a system, to practice a live incident.
Here’s how I would plan a fire drill
- Set a fixed time and date for the drill and inform the team so they can prepare
- Schedule a service window during the fire drill so the organization and its users can prepare
- Book a session with the team to present the incident playbook and make sure they know it
- Break the system at the start of the service window. Automatically restore the system at the end of the service window if the team has failed to find the fault
- Book a postmortem to evaluate the incident response
After an incident you should always conduct a postmortem. The point is to identify the root cause of the incident, find new systems, solutions, processes, routines to make sure the incident doesn’t reoccur.
The purpose is to create a learning organization, where you setup safe-guards for reoccurrence, which protection will remain long after the people involved in the incident are gone.
Things to consider with a postmortem
- Putting blame on a person or a team, doesn’t prevent the incident to reoccur
- Taking responsibility for the incident also won’t prevent it from happening again
- The actions coming out from the retrospective meeting, must prevent the incident from happing again, or you have failed to identify the root cause
Here’s my template for postmortem retrospective to help you ask the right questions to identify the root causes.