Postmortem Template

Date

Version

Changes

Date

Version

Changes

2022-04-12

v1.0.0

 

Incident Summary

Write a summary of the incident in a few sentences, including

  • What happened

  • Why it happened

  • The severity of the incident

  • How long the impact lasted

Key Performance Indicators

KPI

Time

Comment

KPI

Time

Comment

Time to Repair

 

Time form the incident started until normal operation is restored

Time to Recover

 

Time from the incident started until it is resolved

Time to Respond

 

Time from incident was discovered until normal operation was restored

Time to Acknowledge

 

Time from first indication of the incident until work was started to resolve it

Time Since Last Failure

 

Time from the end of previous incident to start of this one

Leadup

Describe the sequence of events that led to the incident, for example

  • Previous changes that introduced bugs that had not yet been detected

Fault

Describe how the change that was implemented didn’t work as expected. If available, attach screenshots of relevant data visualisations that illustrate the fault.

Impact

Describe how the incident impacted internal and external users during the incident. Include how many support cases were raised.

  • How many users were affected

  • For how long time

  • What was the severity of the impact

Detection

When did the team detect the incident? How did they know it was happening? How could we improve time-to-detection?

Consider: How could we cut that time by half?

Response

Who responded to the incident? When did they respond and what did they do? Note any delays or obstacles to responding.

Recovery

Describe how the service was restored and the incident was deemed over. Detail how the service was successfully restored and how you know what steps you needed to take for recovery

Depending on the scenario, consider these questions

  • How could you improve time to mitigation?

  • How could you have cut that time in half?

Timeline

Detail the incident timeline

Include any notable lead-up events, any starts of activity, the first known impact and escalations. Note any decisions or changes made, and when the incident ended, along with any post-impact events of note.

Date/Time

Incident Activity

Date/Time

Incident Activity

 

 

 

 

Root Cause Analysis

The five why's is a root cause identification technique. Here’s how you can use it

  • Begin with a description of the impact and ask why it occurred

  • Note the impact that it had

  • Ask why this happened, and why it had the resulting impact

  • Then continue asking “why” until you arrive at a root cause

List the “why's” in your postmortem documentation

Backlog Check

Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?

A cleared-eyed assessment of the backlog can shed light on past decisions around priority and risk.

Recurrence

Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again.

Lessons Learned

Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement.

Corrective Actions

Describe the corrective action ordered to prevent this class of incidents in the future. Note who is responsible and when they have to complete the work and where that work is being tracked.

Action

Responsible

Deadline

Issue Tracking

Action

Responsible

Deadline

Issue Tracking