Problem reports

Production incidents

In the spirit of continuous improvement, when there are issues in Production, write a post-mortem of the event with root cause details:

Minimally it needs to include:

  1. A summary of the issue
  2. The time/dates of events related to the issue (first notified of issue, root cause identified, issue mitigated, fix delivered to prod)
  3. Business impact of the issue
  4. Engineering Root Cause
  5. Identification / Issue signature (what does this look like in logs, behavior, etc?)
  6. Resolution
  7. Next Steps
  8. Learnings and Mitigation (how have we changed our processes to prevent the issue in the future?)
  9. Link to IMOP incident
  10. Link to Jira issue for defect

Reading material

https://microservices.io/post/microservices/2022/01/04/writing-better-problem-reports.html


Links to this note