Mastering the Art of Post-Mortems: How to Learn from Your Mistakes and Improve Your Processes for Incidents

Jeff Cechinel
3 min readMar 14, 2023
The bug! — https://unsplash.com/@davidclode

🔥Incidents can happen, and they will happen many times during our journey as engineers, especially Software engineers.

A post-mortem is an important process that allows teams to learn from their mistakes and improve their processes for the future. It’s an opportunity to identify the root cause of an incident, assess the impact, and develop a plan to prevent similar incidents from happening again.

In this article, we’ll go over the steps you can take to conduct a kick-ass post-mortem.

https://unsplash.com/@furtado

1. Give a Summary of What Happened

Start the post-mortem with a brief summary of the incident. This should include what happened, the impact it had, and the timeline of events. It’s important to be as detailed as possible, so everyone has a clear understanding of what occurred.

2. Review(Or create) the Incident Report

The next step is to review the incident report (IR). This should be a comprehensive report that details the incident, including any actions taken to mitigate the issue. Give everyone time to read through the IR before moving on to the next step.

3. Software Development Process

In this section, you’ll want to delve into the software development process. Identify the code that malfunctioned and review the tests that were created for it. Were there any limitations to the tests? Did the team perform code reviews or pair programming? Were there any other failing processes in the code development process or similar areas with risk?

4. Testing Approach

Next, review the testing approach. How did QA miss the tests? Were there any gaps in the testing process? Identify any issues that occurred during the testing phase.

5. Environments and Data

It’s essential to review the environments and data that were involved in the incident. Did any issues occur with the environments or data that caused the incident? Were there any data discrepancies?

6. Product Management

In this section, review the non-functional requirements (NFRs) that were provided. Did the product management team adequately define the NFRs? Were there any issues with reconciliation? How did the store managers approve the go-live without seeing the issue, and what could be done differently in the future?

https://unsplash.com/@cdc

7. Operations

Finally, review the operations aspect of the incident. What went wrong with the investigation? Were there any issues with observability and monitoring? Were there any other factors that contributed to the incident?

8. Identify Root Causes

After reviewing each aspect of the incident, identify the root causes. What was the underlying issue that led to the incident? It’s crucial to identify the root causes to prevent similar incidents from happening in the future.

9. Develop an Action Plan

The final step is to develop an action plan. Identify the steps that need to be taken to prevent similar incidents from happening again. This could include improving the testing process, implementing better monitoring, or developing new protocols.

In conclusion, a post-mortem is an essential process for any team that wants to improve its processes and prevent incidents from happening in the future. By following these steps, you can conduct a kick-ass post-mortem that will provide valuable insights and help your team grow. Remember to stay focused, stay objective, and be open to feedback and criticism. With a little effort, you can turn a negative incident into a positive learning experience for your team.

https://unsplash.com/@sincerelymedia

Clap as many times as you want! 😅

--

--

Jeff Cechinel

🇧🇷🇬🇧 Head of Software Development as a hobby. Dad of a gorgeous girl and a 🐺 Border Collie. BJJ Black belt, Poker Player.