Table of Contents
Incident Postmortem: [Incident Title]
Incident Number: INC-[YYYY-MM-DD-###]
Date: [YYYY-MM-DD]
Authors: [Names]
Status: [Draft/Final]
Severity: P[1-4]
Executive Summary
[2-3 sentence summary of what happened, impact, and resolution]
Impact
Duration: [Start time] - [End time] ([Total duration])
Users Affected: [Number or percentage]
Services Affected:
- [ Service 1 ]
- [ Service 2 ]
Business Impact:
- [Lost revenue, if applicable]
- [SLA breaches]
- [User complaints]
Timeline
All times in UTC
Time | Event |
---|---|
HH:MM | Initial alert/report received |
HH:MM | Incident declared |
HH:MM | [Key event] |
HH:MM | Root cause identified |
HH:MM | Fix implemented |
HH:MM | Service restored |
HH:MM | Incident closed |
Root Cause Analysis
What Happened?
[Detailed technical explanation of the failure]
Why Did It Happen?
[Use 5 Whys or similar technique]
- Why did the service fail?
- [Answer]
- Why did [answer above] happen?
- [Answer]
- Continue until root cause...
Root Cause:
[Single sentence describing the fundamental cause]
Detection & Response
How Was This Detected?
- [ ] Automated monitoring
- [ ] User report
- [ ] Engineering observation
What Went Well?
- [Positive aspect 1]
- [Positive aspect 2]
What Could Be Improved?
- [Improvement area 1]
- [Improvement area 2]
Technical Details
Configuration/Code That Failed
# Or relevant code/config
Error Messages/Logs
[Relevant log excerpts]
Debugging Steps Taken
- [Step 1]
- [Step 2]
Lessons Learned
What Did We Learn?
- [Learning 1]
- [Learning 2]
What Surprised Us?
- [Surprise 1]
What Still Puzzles Us?
- [Unknown 1]
Action Items
Action | Owner | Due Date | Priority |
---|---|---|---|
[Implement fix for root cause] | @owner | YYYY-MM-DD | P1 |
[Add monitoring for X] | @owner | YYYY-MM-DD | P2 |
[Update runbook] | @owner | YYYY-MM-DD | P2 |
[Add alerting for Y] | @owner | YYYY-MM-DD | P3 |
Prevention
How Do We Prevent This Specific Issue?
[Specific technical fix]
How Do We Prevent This Class of Issues?
[Broader architectural or process improvements]
How Do We Improve Detection?
[Monitoring/alerting improvements]
Appendix
Links
Supporting Documentation
- [Architecture diagrams]
- [Relevant runbooks]
Postmortem Meeting Notes
Date: [YYYY-MM-DD]
Attendees: [Names]
Discussion Points
- [Point 1]
- [Point 2]
Additional Action Items
- [Any new items from meeting]
Document ID: workflows/operations/incident-response/postmortems/postmortem-template