Postmortems
Document incidents, identify root causes, and track improvements with blameless postmortems
Postmortems
Postmortems (also called Post-Incident Reviews or PIRs) document what happened during an incident, why it happened, and what you'll do to prevent recurrence. OpsKnight provides a structured workflow for creating, reviewing, and tracking postmortems.
Why Postmortems Matter
| Without Postmortems | With Postmortems |
|---|---|
| Same incidents repeat | Learn and prevent recurrence |
| Tribal knowledge | Documented institutional memory |
| Blame culture | Blameless improvement culture |
| No accountability for fixes | Tracked action items with owners |
Postmortem Workflow
OpsKnight postmortems follow a structured lifecycle:
DRAFT → IN_REVIEW → PUBLISHED → ARCHIVED
Workflow States
| Status | Description | Who Can Edit |
|---|---|---|
| DRAFT | Initial creation, work in progress | Author, Editors |
| IN_REVIEW | Ready for team review and feedback | Author, Editors, Reviewers |
| PUBLISHED | Finalized, visible to organization | Admins only |
| ARCHIVED | Historical record, no longer active | Admins only |
State Transitions
┌─────────────┐
│ DRAFT │
└──────┬──────┘
│ Submit for Review
▼
┌─────────────┐
┌─────│ IN_REVIEW │─────┐
│ └──────┬──────┘ │
Request │ │ │ Request
Changes │ │ Approve │ Changes
│ ▼ │
│ ┌─────────────┐ │
└────►│ PUBLISHED │◄────┘
└──────┬──────┘
│ Archive
▼
┌─────────────┐
│ ARCHIVED │
└─────────────┘
When to Write a Postmortem
Required Triggers
| Condition | Rationale |
|---|---|
| Any HIGH urgency incident | Critical issues need documentation |
| Customer-impacting outage | External impact requires review |
| Data loss or security incident | Compliance and learning |
| Incident duration > 1 hour | Extended incidents have lessons |
Recommended Triggers
| Condition | Rationale |
|---|---|
| Recurring incident pattern | Break the cycle |
| Near-miss (almost critical) | Learn before it gets worse |
| Novel failure mode | Document new knowledge |
| Cross-team coordination issues | Process improvements |
Skip Postmortem When
- Incident was false positive
- Root cause is already well-documented
- No meaningful learnings possible
- Incident was immediately auto-resolved
Creating a Postmortem
From an Incident
- Open a resolved incident
- Click Create Postmortem
- Incident data is automatically populated:
- Title and description
- Timeline events
- Affected services
- Participants
From Scratch
- Go to Postmortems in the sidebar
- Click Create Postmortem
- Link to an incident (optional)
- Fill in the template
Postmortem Fields
| Field | Required | Description |
|---|---|---|
| Title | Yes | Clear, descriptive title |
| Incident | No | Linked incident (auto-populates data) |
| Summary | Yes | Executive summary of what happened |
| Timeline | Yes | Chronological event sequence |
| Impact | Yes | Business and customer impact |
| Root Cause | Yes | Technical explanation of failure |
| Resolution | Yes | How the incident was resolved |
| Action Items | Yes | Follow-up tasks with owners |
| Lessons Learned | No | Key takeaways |
| Contributing Factors | No | Additional factors beyond root cause |
| Detection | No | How the incident was discovered |
| Response | No | Evaluation of incident response |
Postmortem Sections
Summary
A brief executive summary (2-3 sentences) answering:
- What happened?
- What was the impact?
- How was it resolved?
Example:
On January 15, 2024, the Payment API experienced a 45-minute outage due to a database connection pool exhaustion. Approximately 2,300 transactions failed during the incident. Service was restored by increasing connection pool limits and restarting affected pods.
Timeline
Chronological sequence of events with timestamps.
| Time | Event |
|---|---|
| 14:00 | Monitoring alert triggered for elevated API latency |
| 14:03 | On-call engineer acknowledged alert |
| 14:08 | Initial investigation started, high DB connection count noted |
| 14:15 | Root cause identified: connection pool exhausted |
| 14:22 | Mitigation applied: increased pool size |
| 14:30 | Service restored, monitoring confirmed |
| 14:45 | Incident resolved, follow-up tasks created |
Timeline Best Practices:
- Use consistent timezone (UTC recommended)
- Include who performed each action
- Note key decisions and why they were made
- Include any false starts or dead ends
Impact
Quantify the business and customer impact.
| Impact Type | Measurement |
|---|---|
| Duration | 45 minutes |
| Affected Users | ~2,300 customers |
| Failed Transactions | 2,347 |
| Revenue Impact | $12,500 estimated |
| SLA Breach | Yes, 99.9% target missed |
| Support Tickets | 47 tickets opened |
Root Cause
Technical explanation of why the incident occurred.
Structure:
- What failed: The specific component or system
- Why it failed: The technical reason
- Why wasn't it caught: Detection gaps
Example:
The database connection pool was configured with a maximum of 50 connections, inherited from initial deployment 2 years ago. Recent traffic growth increased average concurrent connections from 30 to 48. A traffic spike from a marketing campaign pushed connections over the limit, causing new requests to queue and timeout.
The connection pool metrics were not monitored, so the gradual increase went unnoticed until the hard failure.
Resolution
Steps taken to restore service.
| Step | Action | Result |
|---|---|---|
| 1 | Increased connection pool to 100 | Pending connections processed |
| 2 | Restarted 3 affected API pods | Fresh connection pools |
| 3 | Verified transaction processing | Normal throughput resumed |
| 4 | Monitored for 15 minutes | No recurrence |
Action Items
Tracked tasks to prevent recurrence.
| Action | Owner | Due Date | Priority | Status |
|---|---|---|---|---|
| Add connection pool monitoring | @jane | Jan 22 | HIGH | Open |
| Set up alerts at 80% pool usage | @jane | Jan 22 | HIGH | Open |
| Review all DB connection configs | @bob | Jan 29 | MEDIUM | Open |
| Document connection pool sizing | @alice | Feb 5 | LOW | Open |
Lessons Learned
Key takeaways for the team.
What went well:
- Alert fired within 3 minutes of issue
- On-call responded quickly
- Root cause identified in 12 minutes
What could be improved:
- No monitoring on connection pool utilization
- Initial config was never revisited as traffic grew
- Runbook didn't cover connection pool issues
Where we got lucky:
- Traffic spike was moderate; larger spike would have been worse
- Database itself remained healthy
Action Item Tracking
Action Item Fields
| Field | Required | Description |
|---|---|---|
| Description | Yes | What needs to be done |
| Owner | Yes | Person responsible |
| Due Date | Yes | Target completion date |
| Priority | Yes | HIGH, MEDIUM, LOW |
| Status | Yes | OPEN, IN_PROGRESS, COMPLETED, WONT_DO |
| Ticket Link | No | Link to issue tracker (Jira, GitHub, etc.) |
Action Item Statuses
| Status | Meaning |
|---|---|
| OPEN | Not yet started |
| IN_PROGRESS | Work has begun |
| COMPLETED | Task finished |
| WONT_DO | Decided not to pursue (with justification) |
Tracking Progress
View action item status across postmortems:
- Go to Postmortems → Action Items
- Filter by:
- Status (open, overdue, completed)
- Owner
- Priority
- Due date range
- Export for tracking meetings
Overdue Items
OpsKnight highlights overdue action items:
- Items past due date show warning indicator
- Dashboard shows overdue count
- Optional email reminders to owners
Visibility & Sharing
Internal Visibility
| Setting | Who Can View |
|---|---|
| Private | Only participants and editors |
| Team | Members of associated team(s) |
| Organization | All organization members |
External Sharing
For customer communication:
| Option | Description |
|---|---|
| Public Summary | Sanitized version for status page |
| Customer Email | Share directly with affected customers |
| Public Link | Generate shareable read-only link |
What to Share Externally
Include:
- What happened (high level)
- Impact duration
- Resolution confirmation
- Preventive measures (general)
Exclude:
- Internal tooling details
- Specific infrastructure info
- Individual names
- Security-sensitive details
Collaboration Features
Editors
Add collaborators who can edit the postmortem:
- Open postmortem
- Click Editors
- Add team members
- Set permission level (Edit, Comment)
Comments & Discussion
- Add comments to specific sections
- @mention team members
- Resolve comment threads
- Track unresolved comments before publishing
Review Requests
Request formal review before publishing:
- Change status to IN_REVIEW
- Add reviewers
- Reviewers receive notification
- Reviewers can approve or request changes
- All approvals required before publishing
Templates
Default Template
OpsKnight provides a default template with all standard sections.
Custom Templates
Create organization-specific templates:
- Go to Settings → Postmortems → Templates
- Click Create Template
- Define:
- Template name
- Required sections
- Default content/prompts
- Custom fields
- Save template
Template Sections
| Section | Customizable |
|---|---|
| Required/Optional | Yes |
| Default text | Yes |
| Helper prompts | Yes |
| Section order | Yes |
| Custom sections | Yes |
Linking to Incidents
Auto-Population
When creating a postmortem from an incident:
| Auto-Populated | Source |
|---|---|
| Title | Incident title |
| Summary | Incident description |
| Timeline | Incident timeline events |
| Affected Services | Incident services |
| Duration | Incident timestamps |
| Participants | Incident responders |
Multiple Incidents
Link multiple related incidents to one postmortem:
- Common root cause affecting multiple services
- Cascading failures
- Related concurrent incidents
Postmortem Meetings
Scheduling
Schedule a postmortem review meeting:
- Open postmortem
- Click Schedule Meeting
- Select attendees (auto-suggests incident participants)
- Choose date/time
- Generate calendar invite
Meeting Integration
| Platform | Support |
|---|---|
| Google Calendar | Direct integration |
| Outlook/O365 | ICS file download |
| Zoom | Meeting link generation |
| Google Meet | Meeting link generation |
Meeting Agenda
Auto-generated agenda includes:
- Incident summary review
- Timeline walkthrough
- Root cause discussion
- Action item assignment
- Lessons learned
Reporting & Analytics
Postmortem Metrics
| Metric | Description |
|---|---|
| Postmortems Created | Count per period |
| Completion Rate | Draft → Published conversion |
| Average Time to Complete | Days from incident to published |
| Action Item Completion | % of items completed on time |
| Overdue Items | Count of past-due actions |
Trends
Track patterns across postmortems:
- Most common root causes
- Frequently affected services
- Recurring action item types
- Team completion rates
Best Practices
Blameless Culture
| Do | Don't |
|---|---|
| Focus on systems and processes | Blame individuals |
| Ask "what" and "how" | Ask "who" |
| Assume good intentions | Assume negligence |
| Treat failures as learning | Treat failures as punishment |
Writing Quality
- Be specific: Include exact times, metrics, commands
- Be factual: Document what happened, not opinions
- Be complete: Don't skip uncomfortable details
- Be constructive: Every problem needs an action item
Timing
| Phase | Target |
|---|---|
| Draft started | Within 24 hours of resolution |
| Draft completed | Within 48 hours |
| Review completed | Within 1 week |
| Published | Within 2 weeks |
Action Items
- Make them specific and measurable
- Assign one owner (not a team)
- Set realistic due dates
- Track to completion (don't let items rot)
- Link to tickets in your issue tracker
API Access
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/postmortems |
GET | List postmortems |
/api/postmortems |
POST | Create postmortem |
/api/postmortems/:id |
GET | Get postmortem details |
/api/postmortems/:id |
PATCH | Update postmortem |
/api/postmortems/:id/action-items |
GET | List action items |
/api/postmortems/:id/action-items |
POST | Add action item |
Example: Create Postmortem
curl -X POST "https://your-opsknight.com/api/postmortems" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"title": "Payment API Outage - Jan 15",
"incidentId": "inc_abc123",
"summary": "Database connection pool exhaustion caused 45-minute outage",
"status": "DRAFT"
}'
Example: Add Action Item
curl -X POST "https://your-opsknight.com/api/postmortems/pm_xyz/action-items" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"description": "Add connection pool monitoring",
"ownerId": "user_jane",
"dueDate": "2024-01-22",
"priority": "HIGH"
}'
Integrations
Slack
- Notify channel when postmortem is published
- Share postmortem link with formatted preview
- Receive action item reminders
Issue Trackers
| Platform | Features |
|---|---|
| Jira | Create issues from action items, sync status |
| GitHub Issues | Create issues, link PRs |
| Linear | Create issues, track status |
| Asana | Create tasks from action items |
Document Export
| Format | Use Case |
|---|---|
| Formal documentation, compliance | |
| Markdown | Wiki, documentation sites |
| HTML | Email, web publishing |
| JSON | Programmatic access |
Troubleshooting
Can't Create Postmortem
- Verify incident is resolved
- Check you have permission (incident participant or team member)
- Verify postmortem feature is enabled
Can't Publish
- Check all required sections are completed
- Verify all reviewers have approved (if reviews required)
- Check you have publish permission
Action Items Not Syncing
- Verify integration is connected
- Check issue tracker permissions
- Review sync logs in integration settings
Related Topics
- Incidents — Incident lifecycle
- Analytics — Performance metrics
- Teams — Team management
- Status Page — Public communication
Last updated for v1
Edit this page on GitHub