How to Run a Great Incident Post-Mortem

The key to running a successful incident post-mortem report.

Last Updated: November 30, 2022

Post-mortem meetings are a way to analyze failures and prevent them from recurring. In this article, Yoni Farin, co-founder and CTO of Coralogix, discusses what needs to be addressed during post-mortem meetings to make them the most effective.

Software failures happen in production, and every company needs to avoid outages altogether. Finding ways to prevent failures from recurring and, ideally, limiting the number and duration of failures will separate successful companies from the rest. 

What Is an Incident Post-mortem?

An incident post-mortem is a meeting that occurs after a failure in software. A small group of directly-involved individuals meets to describe the failure and its impacts. During the meeting, the team should discuss changes to processes to reduce the chance of the failure recurring. The post-mortem meeting should identify changes that can be implemented and later measured for effectiveness.

The outcome of a post-mortem meeting should be:

    • A detailed incident report based on a template
    • All contributing root causes are wholly understood
    • What preventive actions can be used in the future to reduce the likelihood of recurrence

See More: 3 Key Themes To Take Away From the First-Ever DPE Summit

How to run an effective post-mortem

When to hold the post mortem

The post-mortem meeting should take place as soon as the incident is over. If too much time lapses, team members may forgetOpens a new window the details necessary to dissect the failure. The meeting should occur within 48 hours of the failure’s resolution, though it should still occur even if this timeframe is not possible.

Who should attend the post mortem

Keep the meeting to a small group of team members for post-mortem discussions. While every stakeholder should review the documentation, larger group sizes may hinder the discussion’s productivity. Those attending the post-mortem should be those who responded to the incident and critical stakeholders impacted by the failure.

Document events thoroughly

Documentation taken during a post-mortem meeting should be as detailed as possible. The intention is to review the meeting and incident notes so team members can look back and take suggested actions properly, having understood the context of the failure. Following a templateOpens a new window can help keep the meeting on track and ensure discussion of different stages of the failure and recovery are not skipped.

Keep it blameless

Post-mortems analyze why an incident occurred to change policy and prevent a recurrence. A blameless post-mortem will do this without blaming an individual or team. This requires assuming all parties acted with good intentions. The circumstances that lead to the failure are what must be changed to improve overall performance.

A blameless post-mortemOpens a new window removes all team members’ fear of reprimand or insult. By doing this, communication can proceed with honesty and objectivity; incidents are less likely to be ignored entirely out of fear; a healthier work culture is nurtured, and teams are freed to do their best work. 

Discussion Points During the Meeting

Since this meeting takes place after the issue has been resolved, the individuals in the meeting should, together, be able to give a complete recount of the failure and analyze why it happened. The post-mortem meeting should amalgamate this information and communicate it to other stakeholders. 

Describe the incident and its resolution

The first section of the post-mortem should include different discussions that dissect the failure. First, the incident should be summarized in a few sentences, including what happened and why, how severe it was, and how long it lasted. 

The meeting portion should break down the incident into discrete sections, each focusing on a different aspect of the failure. Each of these sections should be included in the post-mortem template used so they are always included.

1. Leadup

Define the events that lead up to the failure. Was there a new feature deployment? Did an external provider have a failure? Was there a previously-undetected bug?

2. Fault

Describe how what was implemented was intended to work and then compare it to how it worked in reality.

3. Impact

Describe how both internal and external users were affected by the failure. If any support tickets were created during the incident, they could be referenced here.

4. Detection

When and how did the team detect the incident? Were they alerted by an external observability toolOpens a new window , or were customers the first to alert the team of the failure? Teams could discuss ways to improve detection if there was a significant time between the failure occurring and when the team was made aware of it.

5. Response

Who responded to the failure? How long after detection was a response made, and were there any obstacles to responding? What was the response action taken?

6. Recovery

Describe how the failure was fixed and the incident determined to be over. How did the responders know what steps to take to resolve the issue?

7. Timeline

Detail the timeline of events described above, including the time of any lead-up events, the first detection of the issue compared to the known start of the failure, and when the incident was deemed over.

See More: How to Use Progressive Deployment to Address Dev Team Burnout

Define the root cause of the incident

Defining the root cause of the failure is critical to improving processes or systems in the company to prevent reoccurrence. Unfortunately, sometimes there may be several contributing causes for a failure. To get down to the root cause, it is helpful to ask why decisions were made, again assuming they were made in good faith. 

Root cause analysis can be complex when the failure is deep in software architecture or due to an edge case in user action. To ensure the root cause of a software failure can be found, observability toolsOpens a new window should be in place to help teams identify failures quickly. 

Discuss corrective actions to prevent the issue from repeating

After determining the processes that caused the error, corrective action can be established. This may be a new training program, a change in testing processes, or a change to automate a process, so human error is less likely. The corrective action should be directly linked to the incident’s root cause to prevent it from occurring in the future. 

Preventing Future Failures

A successful post-mortem meeting will identify processes and policies to prevent failures from recurring and will not place blame on an individual’s actions. Identify the root cause(s) of the failure from observability data and understand what issues were seen by customers. Take corrective actions by updating processes to prevent similar failures from occurring. 

What are your key steps to an effective incident post-mortem? Share with us on FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window .

Image Source: Shutterstock

MORE ON DEVOPS

Yoni Farin
Yoni Farin is CTO & Co-Founder​ of Coralogix. He has over 15 years of experience in software development, team and group leadership. Yoni specializes in big data and distributed systems.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.