Post-Mortem Culture
Build blameless post-mortem practices, conduct effective root cause analysis, drive continuous improvement, and create a learning organization.
Prerequisites
- Experience with incident response
- Understanding of team dynamics
- Basic facilitation skills
- Access to incident data
Learning Objectives
- Understand blameless post-mortem principles
- Facilitate effective post-mortem meetings
- Conduct root cause analysis using proven techniques
- Create actionable improvement items
- Build a culture of continuous learning
Step-by-Step Guide
1Blameless Post-Mortem Principles
Learn the foundation of blameless culture.
# Blameless Post-Mortem Principles
cat > /policies/blameless-culture.md << 'EOF'
# Blameless Post-Mortem Policy
## Core Principles
### 1. Focus on Systems, Not People
**Instead of:** "Who made the mistake?"
**Ask:** "What allowed this mistake to happen?"
**Instead of:** "Why did John deploy at 2 AM?"
**Ask:** "Why does our system allow deployments at any time?"
### 2. Assume Good Intent
Everyone involved was trying to do their best with the information
they had at the time. Focus on understanding their decision-making
process, not judging it.
### 3. Psychological Safety
Create an environment where people feel safe to:
- Admit mistakes without fear of punishment
- Share what they were thinking
- Discuss what they would do differently
- Ask questions without appearing ignorant
### 4. Learn, Don't Punish
The goal is learning and improvement, not finding someone to blame.
If people fear punishment, they will hide mistakes instead of learning.
### 5. Share Knowledge
Post-mortems should be:
- Visible to the entire organization
- Searchable for future reference
- Used to prevent similar incidents
## Language Guidelines
### Avoid:
- "John forgot to..."
- "Sarah made a mistake..."
- "The intern didn't know..."
- "Why didn't they..."
### Use Instead:
- "The process didn't catch..."
- "The system allowed..."
- "The documentation didn't specify..."
- "Our training didn't cover..."
## Example Transformations
| Blameful | Blameless |
|----------|-----------|
| "John deployed the wrong version" | "The deployment process didn't validate the version" |
| "Sarah didn't check the logs" | "The monitoring didn't alert on this condition" |
| "Mike missed the alert" | "The alert was not prominent enough" |
| "The team was tired" | "The on-call schedule needs adjustment" |
| "They didn't follow the runbook" | "The runbook wasn't clear for this scenario" |
## Facilitator Responsibilities
1. **Set the Tone:** Start by reminding everyone of blameless principles
2. **Redirect Blame:** Gently redirect when blame language appears
3. **Protect Participants:** Shield people from external blame
4. **Focus on Systems:** Keep the conversation on process and system issues
5. **Document Fairly:** Ensure the post-mortem reflects blameless principles
## Leadership Responsibilities
1. **Model Behavior:** Leaders must demonstrate blameless behavior
2. **Enforce Policy:** Protect the blameless culture from violations
3. **Reward Transparency:** Recognize people who share mistakes openly
4. **Invest in Improvements:** Fund the action items from post-mortems
5. **Share Post-Mortems:** Make them visible organization-wide
## When Blameless Doesn't Work
Blameless culture doesn't mean no accountability. Individual accountability
is still important for:
- **Repeated Negligence:** Same person, same mistake, multiple times
- **Intentional Misconduct:** Deliberate harmful actions
- **Policy Violations:** Knowingly violating security policies
In these cases, handle separately from the post-mortem process.
EOF
# Pre-Mortem Exercise (Preventive)
cat > /templates/pre-mortum.md << 'EOF'
# Pre-Mortem: [Project/Change Name]
## Exercise Instructions
Imagine it's 6 months from now and this change has caused a major incident.
Write the story of what went wrong.
## Scenario Writing
**Date:** 6 months from now
**Severity:** SEV1
**Impact:** [Describe the disaster]
**What Happened:**
[Write the incident story in detail]
**Root Causes:**
1. [Cause 1]
2. [Cause 2]
3. [Cause 3]
**What Should We Have Done Differently:**
1. [Prevention 1]
2. [Prevention 2]
3. [Prevention 3]
## Action Items from Pre-Mortem
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |
EOF
2Root Cause Analysis Techniques
Master proven RCA methodologies.
# Root Cause Analysis Techniques
cat > /guides/root-cause-analysis.md << 'EOF'
# Root Cause Analysis Techniques
## 1. The 5 Whys
Ask "why" repeatedly (usually 5 times) to drill down to root cause.
**Example:**
1. Why did the service go down?
- The database ran out of connections
2. Why did it run out of connections?
- A connection leak in the new code
3. Why was there a connection leak?
- Connections weren't closed in error paths
4. Why weren't they closed in error paths?
- The error handling code didn't include cleanup
5. Why didn't it include cleanup?
- **Root Cause:** Code review checklist doesn't include connection cleanup verification
## 2. Fishbone Diagram (Ishikawa)
Categorize potential causes:
- **People:** Training, skills, communication
- **Process:** Procedures, workflows, approvals
- **Technology:** Tools, systems, infrastructure
- **Environment:** Timing, context, external factors
- **Measurement:** Monitoring, metrics, alerts
- **Materials:** Documentation, code, configurations
## 3. Change Analysis
Identify what changed before the incident:
- Code deployments
- Configuration changes
- Infrastructure updates
- Process changes
- External dependencies
## 4. Timeline Analysis
Create detailed timeline:
- When did it start?
- What happened before?
- What happened during?
- What happened after?
- When was it detected?
- When was it resolved?
## 5. Contributing Factors
Identify all factors that contributed:
- Primary cause (the trigger)
- Secondary causes (enablers)
- Mitigating factors (what reduced impact)
- Aggravating factors (what increased impact)
## 6. Causal Factor Tree
Build a tree of causes:
```
Incident
├── Direct Cause
│ ├── Contributing Factor 1
│ │ └── Root Cause 1
│ └── Contributing Factor 2
│ └── Root Cause 2
└── Indirect Cause
└── Root Cause 3
```
## 7. Barrier Analysis
Identify what barriers should have prevented the incident:
- Technical barriers (automated tests, validation)
- Process barriers (code review, approval)
- Physical barriers (access control, segmentation)
- Administrative barriers (training, policies)
For each failed barrier, ask:
- Why did it fail?
- How can we strengthen it?
- Should we add more barriers?
EOF
# Automated Timeline Builder
cat > /scripts/build-timeline.sh << 'EOF'
#!/bin/bash
# Build incident timeline from various sources
INCIDENT_ID=$1
OUTPUT_FILE="/post-mortems/${INCIDENT_ID}-timeline.md"
echo "# Incident Timeline: $INCIDENT_ID" > $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
# Get incident metadata
echo "## Incident Details" >> $OUTPUT_FILE
curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID" | \
jq -r '"**Started:** \(.started_at)\n**Resolved:** \(.resolved_at)\n**Duration:** \(.duration)\n**Severity:** \(.severity)"' >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
# Build timeline
echo "## Timeline" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
# Get timeline from incident management tool
curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID/timeline" | \
jq -r '.[] | "- **\(.timestamp)**: \(.description) [\(.actor)]"' >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
# Add deployment context
echo "## Related Deployments" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
curl -s "https://api.deploy.io/v1/deployments?service=affected&from=INCIDENT_START&to=INCIDENT_END" | \
jq -r '.[] | "- \(.timestamp): \(.user) deployed \(.commit) to \(.environment)"' >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
# Add alert history
echo "## Alerts" >> $OUTPUT_FILE
echo "" >> $OUTPUT_FILE
curl -s "https://api.prometheus.io/api/v1/query_range?query=alertmanager_alerts&start=INCIDENT_START&end=INCIDENT_END" | \
jq -r '.data.result[] | "- **\(.values[0][0]|todate)**: \(.metric.alertname) = \(.values[0][1])"' >> $OUTPUT_FILE
echo "Timeline built: $OUTPUT_FILE"
EOF
chmod +x /scripts/build-timeline.sh
3Facilitating Post-Mortem Meetings
Learn to run effective post-mortem sessions.
# Post-Mortem Meeting Guide
cat > /guides/facilitating-post-mortems.md << 'EOF'
# Facilitating Post-Mortem Meetings
## Pre-Meeting Preparation (1-2 days before)
### 1. Gather Participants
- Incident Commander (facilitator)
- Technical Lead
- People who responded
- People who were impacted
- Subject matter experts
- Scribe (to document)
**Invite Template:**
```
Subject: Post-Mortem: [Incident Title] - [Date]
Hi team,
We're scheduling a post-mortem for the [SEV1/SEV2] incident on [date].
When: [Date] at [Time] ([Duration])
Where: [Meeting link]
Prep: Please review the incident timeline: [link]
This is a blameless learning session. Our goal is to understand
what happened and how to prevent similar incidents.
Please come prepared to share:
- What you observed
- What decisions you made and why
- What you learned
- What could be improved
Looking forward to a productive session.
```
### 2. Prepare Materials
- Incident timeline
- Relevant logs and metrics
- Runbooks consulted
- Chat transcripts
- Previous related post-mortems
### 3. Set Up Environment
- Video conference link
- Shared document for notes
- Recording (with consent)
- Timer for agenda items
## Meeting Agenda (60-90 minutes)
### Welcome & Ground Rules (5 min)
- Welcome everyone
- Remind of blameless principles
- Explain the agenda
- Confirm comfort level
### Incident Overview (10 min)
- Read executive summary
- Review impact
- Confirm timeline accuracy
### What Happened (20 min)
- Walk through timeline
- Each person shares their perspective
- Ask clarifying questions
- Focus on understanding, not judging
### Root Cause Analysis (20 min)
- Use 5 Whys or other technique
- Identify all contributing factors
- Distinguish root causes from symptoms
- Document causal chain
### What Went Well (10 min)
- Celebrate successes
- Identify good practices
- Note what to keep doing
### What Could Improve (15 min)
- Identify improvement areas
- Brainstorm solutions
- Consider trade-offs
### Action Items (10 min)
- Create specific, actionable items
- Assign owners
- Set due dates
- Prioritize (P0, P1, P2)
### Closing (5 min)
- Summarize key learnings
- Thank participants
- Share next steps
- Confirm follow-up
## Facilitation Tips
### Do:
- ✅ Keep the conversation on track
- Encourage quiet participants
- Redirect blameful language
- Ask open-ended questions
- Summarize frequently
- Watch the time
### Don't:
- Don't dominate the conversation
- Don't jump to solutions
- Don't allow blame or shame
- Don't skip the "what went well"
- Don't end without action items
## Handling Difficult Situations
### Someone Starts Blaming
**Response:** "Let's focus on what in the system allowed this to happen, rather than who made the mistake."
### Someone Gets Defensive
**Response:** "We're all here to learn. What was your thinking at the time?"
### Conversation Goes Off-Track
**Response:** "That's an important topic. Let's park it and focus on today's incident."
### Someone Wants to Skip Action Items
**Response:** "The value of post-mortems is in the improvements. Let's make sure we capture action items."
## Post-Meeting Follow-Up
### Within 24 Hours
- Send meeting notes
- Share draft post-mortem
- Collect feedback
- Schedule review
### Within 1 Week
- Publish final post-mortem
- Create tracking for action items
- Share with broader organization
- Schedule follow-up check
### Ongoing
- Track action item completion
- Reference in relevant discussions
- Celebrate improvements
- Measure impact
EOF
# Post-Mortem Meeting Script
cat > /templates/post-mortem-script.md << 'EOF'
# Post-Mortem Facilitator Script
## Opening (5 min)
"Thanks everyone for being here. Let's start by reminding ourselves
of our ground rules:
1. This is blameless - we focus on systems, not people
2. Assume good intent - everyone was doing their best
3. Be honest - we can't learn if we hide things
4. One conversation - let people finish their thoughts
5. Focus on learning - our goal is improvement
Our agenda today:
- Quick incident overview (10 min)
- What happened (20 min)
- Root cause analysis (20 min)
- What went well (10 min)
- What could improve (15 min)
- Action items (10 min)
Does everyone agree to these ground rules? Any questions?"
## Incident Overview (10 min)
"Let me walk through what we know happened..."
[Read executive summary]
"Does this match everyone's understanding? Any corrections?"
## What Happened (20 min)
"Let's go through the timeline. Starting with [first responder],
can you walk us through what you experienced?"
[For each person]
"Thanks [name]. [Next person], what was your perspective?"
"Let me make sure I understand. [Paraphrase]. Is that right?"
## Root Cause Analysis (20 min)
"Now let's dig into why this happened. Let's use the 5 Whys technique."
"Why did the incident occur?"
"[Answer]"
"Why did that happen?"
"[Answer]"
...
"Let me summarize the root causes we've identified..."
## What Went Well (10 min)
"Before we talk about improvements, let's acknowledge what went well.
What should we keep doing?"
[Capture responses]
"These are great. Let's make sure we document these as best practices."
## What Could Improve (15 min)
"Now, what could we do better next time? Think about:
- Detection
- Response
- Communication
- Prevention"
[Capture ideas]
"Let's group these into themes..."
## Action Items (10 min)
"Based on our discussion, here are the action items I'm hearing:
1. [Action] - Who can own this? What's a realistic due date?
2. [Action] - Owner? Due date?
..."
"Let's prioritize these. Which are P0 (must do this sprint)?"
## Closing (5 min)
"To summarize, our key learnings are:
1. [Learning 1]
2. [Learning 2]
3. [Learning 3]
Action items will be tracked in [system]. I'll send the draft
post-mortem by [date] for review.
Thank you all for your honesty and participation. This is how we
get better together."
EOF
4Action Item Management
Ensure post-mortem improvements are implemented.
# Action Item Tracking System
cat > /scripts/action-item-tracker.py << 'EOF'
#!/usr/bin/env python3
"""Track and manage post-mortem action items"""
import boto3
from datetime import datetime, timedelta
import json
class ActionItemTracker:
def __init__(self):
self.dynamodb = boto3.resource('dynamodb')
self.table = self.dynamodb.Table('post_mortem_actions')
self.sns = boto3.client('sns')
def create_action_item(self, post_mortem_id, action, owner, priority, due_date):
"""Create a new action item"""
item = {
'id': f"{post_mortem_id}-{datetime.now().strftime('%Y%m%d%H%M%S')}",
'post_mortem_id': post_mortem_id,
'action': action,
'owner': owner,
'priority': priority,
'due_date': due_date,
'status': 'open',
'created_at': datetime.utcnow().isoformat(),
'completed_at': None,
'updates': []
}
self.table.put_item(Item=item)
# Notify owner
self._notify_owner(item)
return item['id']
def _notify_owner(self, item):
"""Notify action item owner"""
message = f"""
New action item assigned to you:
Action: {item['action']}
Priority: {item['priority']}
Due: {item['due_date']}
Post-Mortem: {item['post_mortem_id']}
Please update status regularly.
"""
# Send via Slack or email
# Implementation depends on your notification system
def get_overdue_items(self):
"""Get all overdue action items"""
today = datetime.utcnow().date()
response = self.table.scan(
FilterExpression='status = :status AND due_date < :today',
ExpressionAttributeValues={
':status': 'open',
':today': str(today)
}
)
return response['Items']
def get_completion_rate(self):
"""Calculate action item completion rate"""
response = self.table.scan()
items = response['Items']
if not items:
return 0
completed = sum(1 for item in items if item['status'] == 'completed')
return completed / len(items) * 100
def generate_report(self):
"""Generate action item report"""
items = self.table.scan()['Items']
report = {
'generated_at': datetime.utcnow().isoformat(),
'total_items': len(items),
'open': sum(1 for i in items if i['status'] == 'open'),
'in_progress': sum(1 for i in items if i['status'] == 'in_progress'),
'completed': sum(1 for i in items if i['status'] == 'completed'),
'overdue': len(self.get_overdue_items()),
'completion_rate': self.get_completion_rate(),
'by_priority': {
'P0': sum(1 for i in items if i['priority'] == 'P0'),
'P1': sum(1 for i in items if i['priority'] == 'P1'),
'P2': sum(1 for i in items if i['priority'] == 'P2')
},
'by_owner': {}
}
# Group by owner
for item in items:
owner = item['owner']
if owner not in report['by_owner']:
report['by_owner'][owner] = {'total': 0, 'completed': 0}
report['by_owner'][owner]['total'] += 1
if item['status'] == 'completed':
report['by_owner'][owner]['completed'] += 1
return report
def send_weekly_digest(self):
"""Send weekly action item digest"""
report = self.generate_report()
overdue = self.get_overdue_items()
message = f"""
Weekly Post-Mortem Action Items Digest
Completion Rate: {report['completion_rate']:.1f}%
Total Items: {report['total_items']}
Open: {report['open']}
Overdue: {report['overdue']}
Overdue Items:
"""
for item in overdue[:10]: # Top 10 overdue
message += f"- [{item['priority']}] {item['action']} (Owner: {item['owner']}, Due: {item['due_date']})\n"
# Send to leadership channel
self.sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:post-mortem-digest',
Message=message,
Subject='Weekly Post-Mortem Action Items Digest'
)
# Scheduled task (run weekly)
if __name__ == '__main__':
tracker = ActionItemTracker()
tracker.send_weekly_digest()
EOF
chmod +x /scripts/action-item-tracker.py
5Building a Learning Organization
Create a culture of continuous improvement.
# Learning Organization Framework
cat > /policies/learning-organization.md << 'EOF'
# Building a Learning Organization
## Key Practices
### 1. Regular Post-Mortems
- **Every SEV1/SEV2:** Full post-mortem within 48 hours
- **SEV3:** Lightweight post-mortem within 1 week
- **Near Misses:** Encourage reporting without incident
### 2. Knowledge Sharing
- **Post-Mortem Library:** Searchable, tagged, accessible
- **Weekly Learning Sessions:** Share post-mortem insights
- **Brown Bag Lunches:** Deep dive into specific topics
- **Documentation:** Update runbooks and procedures
### 3. Metrics & Measurement
Track:
- Post-mortem completion rate (target: 100%)
- Action item completion rate (target: >90%)
- Time to post-mortem (target: <48 hours for SEV1)
- Recurring incident rate (target: 0%)
- Mean time to recovery (MTTR) trend
### 4. Recognition & Rewards
- **Learning Awards:** Recognize teams that share learnings
- **Improvement Awards:** Celebrate action item completions
- **Transparency Awards:** Reward honest post-mortems
- **Mentorship:** Pair experienced with new engineers
### 5. Continuous Improvement
- **Quarterly Reviews:** Review post-mortem process
- **Annual Retrospective:** What's working, what's not
- **Benchmarking:** Compare with industry standards
- **Training:** Regular facilitation training
## Post-Mortem Quality Checklist
### Content
- [ ] Executive summary is clear and concise
- [ ] Timeline is accurate and complete
- [ ] Root causes are well-analyzed
- [ ] Contributing factors are identified
- [ ] What went well is documented
- [ ] Action items are specific and actionable
- [ ] Action items have owners and due dates
- [ ] Action items are prioritized
### Process
- [ ] Completed within 48 hours (SEV1/SEV2)
- [ ] All responders participated
- [ ] Blameless language used throughout
- [ ] Facilitator followed guidelines
- [ ] Scribe captured key points
- [ ] Review by leadership completed
- [ ] Published to organization
- [ ] Added to post-mortem library
### Follow-up
- [ ] Action items tracked in system
- [ ] Regular status updates
- [ ] Completion verified
- [ ] Impact measured
- [ ] Related documentation updated
## Learning Metrics Dashboard
```yaml
metrics:
post_mortem_completion:
definition: "Percentage of incidents with completed post-mortems"
target: 100
current: 95
action_item_completion:
definition: "Percentage of action items completed on time"
target: 90
current: 85
recurring_incidents:
definition: "Percentage of incidents with same root cause"
target: 0
current: 5
mttr_trend:
definition: "Mean time to recovery trend"
target: decreasing
current: stable
learning_sessions:
definition: "Number of learning sessions per month"
target: 4
current: 2
documentation_updates:
definition: "Documentation updates from post-mortems"
target: 10
current: 6
```
## Success Stories
### Example 1: Reduced Deployment Incidents
**Problem:** 5 deployment-related incidents in Q1
**Post-Mortem Finding:** No automated testing for deployments
**Action:** Implemented deployment pipeline with tests
**Result:** 0 deployment incidents in Q2-Q4
### Example 2: Faster Incident Detection
**Problem:** Average detection time of 45 minutes
**Post-Mortem Finding:** Monitoring gaps in key services
**Action:** Enhanced monitoring and alerting
**Result:** Average detection time of 5 minutes
### Example 3: Improved Runbook Quality
**Problem:** Runbooks incomplete or outdated
**Post-Mortem Finding:** No process for runbook maintenance
**Action:** Runbook review as part of incident follow-up
**Result:** 100% runbook coverage for common incidents
EOF
# Post-Mortem Library
cat > /scripts/build-post-mortem-library.sh << 'EOF'
#!/bin/bash
# Build searchable post-mortem library
OUTPUT_DIR="/docs/post-mortems"
mkdir -p $OUTPUT_DIR
# Generate index
cat > $OUTPUT_DIR/INDEX.md << 'INDEX'
# Post-Mortem Library
This library contains all post-mortems for organizational learning.
## Search by Tag
- [All Post-Mortems](#all)
- [Database](#database)
- [Deployment](#deployment)
- [Network](#network)
- [Security](#security)
- [Capacity](#capacity)
## By Year
- [2026](#2026)
- [2025](#2025)
## By Severity
- [SEV1](#sev1)
- [SEV2](#sev2)
- [SEV3](#sev3)
## Recently Added
INDEX
# Process all post-mortems
for pm in /post-mortems/*.md; do
title=$(head -1 $pm | sed 's/# //')
date=$(grep -E "^\*\*Date\*\*:" $pm | cut -d: -f2 | tr -d ' ')
severity=$(grep -E "^\*\*Severity\*\*:" $pm | cut -d: -f2 | tr -d ' ')
tags=$(grep -E "^\*\*Tags\*\*:" $pm | cut -d: -f2)
echo "- [$title]($pm) - $date - $severity - $tags" >> $OUTPUT_DIR/INDEX.md
done
echo "Post-mortem library updated: $OUTPUT_DIR/INDEX.md"
EOF
chmod +x /scripts/build-post-mortem-library.sh
Best Practices
Learning Organization Principles:
- Blameless Always: Never punish for honest mistakes
- Share Widely: Make post-mortems visible to all
- Act on Learnings: Complete action items consistently
- Measure Improvement: Track metrics over time
- Celebrate Learning: Recognize teams that improve
- Continuous Process: Regularly review and improve the process
Assessment
1. What is the primary goal of a post-mortem?
2. How many "whys" does the 5 Whys technique typically use?
3. When should a post-mortem be completed for a SEV1 incident?
4. What should action items include?
Answer Key: 1-B