Incident Response
Master NIST incident response lifecycle: detection, triage, containment, mitigation, and recovery. Learn to lead effective incident response operations.
Prerequisites
- Basic understanding of IT systems
- Familiarity with monitoring tools
- Knowledge of your organization's systems
- Access to incident management tools
Learning Objectives
- Understand the NIST incident response lifecycle
- Classify incidents by severity
- Lead incident response operations
- Implement containment and mitigation strategies
- Communicate effectively during incidents
- Conduct post-incident analysis
Step-by-Step Guide
1Incident Detection & Classification
Learn to detect and classify incidents quickly.
# Incident Detection Sources
# 1. Automated Monitoring
# - Alerting systems (PagerDuty, Opsgenie)
# - SIEM alerts (Splunk, ELK)
# - Application monitoring (Datadog, New Relic)
# - Infrastructure monitoring (Prometheus, CloudWatch)
# 2. User Reports
# - Support tickets
# - Status page reports
# - Social media mentions
# - Customer emails
# 3. Security Alerts
# - IDS/IPS alerts
# - Threat intelligence feeds
# - Vulnerability scans
# - Anomaly detection
# Severity Classification
cat > /policies/incident-severity.md << 'EOF'
# Incident Severity Classification
## SEV1 - Critical
**Definition:** Complete service outage or data loss
**Response Time:** Immediate page
**Examples:**
- Service completely unavailable for >10% of users
- Data loss or corruption
- Security breach with data exposure
- Payment processing failure
## SEV2 - Major
**Definition:** Major degradation with significant impact
**Response Time:** <15 minutes
**Examples:**
- 30-50% error rate
- Performance 5x slower than normal
- Core feature unavailable
- Affecting >10% of users
## SEV3 - Minor
**Definition:** Minor degradation with limited impact
**Response Time:** <1 hour
**Examples:**
- <10% error rate
- Non-core feature unavailable
- Cosmetic issues
- Affecting <10% of users
## SEV4 - Trivial
**Definition:** No user impact, internal issue
**Response Time:** Business hours
**Examples:**
- Internal tool issues
- Documentation errors
- Feature requests
- Single user issues
EOF
# Automated Incident Creation
cat > /scripts/create-incident.sh << 'EOF'
#!/bin/bash
# Automated incident creation from alerts
ALERT_SERVICE=$1
ALERT_SEVERITY=$2
ALERT_MESSAGE=$3
# Map alert severity to incident severity
case $ALERT_SEVERITY in
critical) INCIDENT_SEV="sev1" ;;
error) INCIDENT_SEV="sev2" ;;
warning) INCIDENT_SEV="sev3" ;;
*) INCIDENT_SEV="sev4" ;;
esac
# Create incident
INCIDENT_ID=$(curl -s -X POST https://api.incident.io/v2/incidents \
-H "Authorization: Bearer $INCIDENT_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"title\": \"${ALERT_SERVICE}: ${ALERT_MESSAGE}\",
\"severity\": \"$INCIDENT_SEV\",
\"service_id\": \"$ALERT_SERVICE\",
\"monitor_id\": \"$ALERT_SERVICE-monitor\"
}" | jq -r '.id')
echo "Created incident: $INCIDENT_ID"
# Notify Slack
curl -X POST $SLACK_INCIDENT_WEBHOOK \
-H "Content-Type: application/json" \
-d "{
\"text\": \"🚨 New $INCIDENT_SEV incident: $ALERT_SERVICE\",
\"attachments\": [{
\"color\": \"$([ \"$INCIDENT_SEV\" = \"sev1\" ] && echo \"danger\" || echo \"warning\")\",
\"fields\": [
{\"title\": \"Service\", \"value\": \"$ALERT_SERVICE\", \"short\": true},
{\"title\": \"Severity\", \"value\": \"$INCIDENT_SEV\", \"short\": true},
{\"title\": \"Message\", \"value\": \"$ALERT_MESSAGE\"}
]
}]
}"
EOF
chmod +x /scripts/create-incident.sh
2Incident Response Team Setup
Establish roles and responsibilities.
# Incident Response Team Structure
# Roles and Responsibilities
cat > /policies/incident-roles.md << 'EOF'
# Incident Response Roles
## Incident Commander (IC)
**Responsibilities:**
- Overall incident coordination
- Decision-making authority
- Resource allocation
- Communication management
- Escalation decisions
**Skills Required:**
- Calm under pressure
- Decision-making
- Communication
- System knowledge
## Technical Lead
**Responsibilities:**
- Technical diagnosis
- Mitigation strategy
- Implementation oversight
- Technical decisions
- Root cause analysis
**Skills Required:**
- Deep system knowledge
- Debugging expertise
- Architecture understanding
## Communications Lead
**Responsibilities:**
- Internal stakeholder updates
- Customer communications
- Status page updates
- Social media management
- Executive briefings
**Skills Required:**
- Clear writing
- Stakeholder management
- Empathy
## Scribe
**Responsibilities:**
- Timeline documentation
- Decision logging
- Communication capture
- Evidence collection
- Post-mortem data gathering
**Skills Required:**
- Attention to detail
- Organization
- Note-taking
## On-Call Engineer
**Responsibilities:**
- First responder
- Initial investigation
- Escalation
- Runbook execution
**Skills Required:**
- System familiarity
- Troubleshooting
- Runbook knowledge
EOF
# Escalation Matrix
cat > /policies/escalation-matrix.yaml << 'EOF'
escalation:
sev1:
initial:
- on-call-engineer
- incident-commander
15min:
- engineering-manager
30min:
- vp-engineering
60min:
- cto
sev2:
initial:
- on-call-engineer
30min:
- incident-commander
60min:
- engineering-manager
sev3:
initial:
- on-call-engineer
2hours:
- incident-commander
sev4:
initial:
- on-call-engineer
business_hours_only: true
contacts:
on-call-engineer:
pagerduty: engineering-oncall
slack: "@oncall"
incident-commander:
pagerduty: incident-commander
slack: "@ic"
engineering-manager:
slack: "@eng-manager"
phone: "+1-xxx-xxx-xxxx"
vp-engineering:
slack: "@vp-eng"
phone: "+1-xxx-xxx-xxxx"
cto:
slack: "@cto"
phone: "+1-xxx-xxx-xxxx"
EOF
3Containment Strategies
Implement effective containment measures.
# Containment Strategies
# 1. Service Isolation
kubectl label pod affected-pod incident=isolated
kubectl cordon node-affected
# 2. Traffic Diversion
# Enable circuit breaker
kubectl set env deployment/api-service CIRCUIT_BREAKER=enabled
# Route traffic away
kubectl patch virtualservice api-vs -p '{
"spec": {
"http": [{
"route": [{
"destination": {"host": "api", "subset": "stable"},
"weight": 100
}]
}]
}
}'
# 3. Rate Limiting
# Emergency rate limiting
kubectl patch configmap ratelimit-config -p '{
"data": {
"requests_per_second": "100",
"burst_size": "50"
}
}'
# 4. Feature Flags
# Disable problematic feature
curl -X PATCH https://api.flagsmith.com/api/v1/features/disable-feature/ \
-H "Authorization: Bearer $FLAGS_API_KEY" \
-d '{"feature_id": 123, "enabled": false}'
# 5. Database Protection
# Read-only mode for database
mysql -e "SET GLOBAL super_read_only = ON;"
# Block specific queries
cat > /etc/iptables/rules.v4 << 'EOF'
# Block suspicious traffic
-A INPUT -s 192.168.1.100 -j DROP
-A INPUT -p tcp --dport 3306 -s !10.0.0.0/8 -j DROP
EOF
iptables-restore < /etc/iptables/rules.v4
# 6. Automated Containment Script
cat > /scripts/emergency-containment.sh << 'EOF'
#!/bin/bash
set -e
SERVICE=$1
ACTION=${2:-"circuit-breaker"}
echo "Initiating emergency containment for $SERVICE..."
case $ACTION in
circuit-breaker)
echo "Enabling circuit breaker..."
kubectl set env deployment/$SERVICE CIRCUIT_BREAKER=enabled
;;
scale-down)
echo "Scaling down to reduce impact..."
kubectl scale deployment/$SERVICE --replicas=1
;;
rollback)
echo "Rolling back to previous version..."
kubectl rollout undo deployment/$SERVICE
;;
maintenance)
echo "Enabling maintenance mode..."
kubectl set env deployment/$SERVICE MAINTENANCE_MODE=true
;;
isolate)
echo "Isolating affected pods..."
kubectl label pods -l app=$SERVICE incident=isolated --overwrite
;;
*)
echo "Unknown action: $ACTION"
exit 1
;;
esac
# Notify team
curl -X POST $SLACK_WEBHOOK -d "{
\"text\": \"🚨 Emergency containment initiated for $SERVICE ($ACTION)\"
}"
echo "Containment complete"
EOF
chmod +x /scripts/emergency-containment.sh
4Communication During Incidents
Communicate effectively with all stakeholders.
# Incident Communication Templates
cat > /templates/incident-communications.md << 'EOF'
# Incident Communication Templates
## Initial Notification (0-5 min)
**Channel:** Slack #incidents, #exec-alerts
**Template:**
```
🚨 INCIDENT DECLARED: [SEV1/SEV2/SEV3] - [Brief Description]
Service: [service-name]
Severity: [SEV1/SEV2/SEV3]
Started: [timestamp]
Impact: [Brief impact description]
Incident Channel: #incident-[id]
Incident Commander: [@name]
Status: Investigating
Next update in 15 minutes.
```
## Status Page Update (Every 15-30 min)
**Channel:** Status page
**Template:**
```
[Investigating/Identified/Monitoring/Resolved] - [Service Name]
We are [investigating/working on] an issue affecting [service].
[Optional: Brief technical details for technical audience]
[Optional: Estimated resolution time]
Last updated: [timestamp]
```
## Executive Briefing (SEV1 only, Every 30 min)
**Channel:** Slack #exec-alerts, Email to exec team
**Template:**
```
INCIDENT UPDATE - [SEV1] [Service Name]
Time: [timestamp]
Duration: [X minutes]
Current Status: [Investigating/Identified/Mitigating/Resolved]
Impact:
- [X]% of users affected
- [Description of impact]
- [Revenue impact if known]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Next action]
- [ETA for resolution]
Incident Commander: [@name]
```
## Customer Communication (If significant impact)
**Channel:** Email, in-app notification
**Template:**
```
Subject: Service Update: [Issue Description]
Dear [Customer],
We're experiencing an issue affecting [service] that may impact
your ability to [action].
What we know:
- [Known information]
- [Impact description]
What we're doing:
- [Actions being taken]
- [ETA if available]
We'll update you within [timeframe] with more information.
Apologies for the inconvenience.
The [Company] Team
```
## Resolution Notification
**Channel:** All channels used during incident
**Template:**
```
✅ INCIDENT RESOLVED: [SEV1/SEV2/SEV3] - [Brief Description]
Service: [service-name]
Severity: [SEV1/SEV2/SEV3]
Duration: [X minutes]
Root Cause: [Brief description]
Resolution: [What was done to fix]
Post-Mortem: Will be available within 24 hours at [link]
Thank you to everyone who helped resolve this incident.
```
EOF
# Automated Status Page Updates
cat > /scripts/update-status-page.sh << 'EOF'
#!/bin/bash
# Automated status page updates
INCIDENT_ID=$1
STATUS=$2 # investigating, identified, monitoring, resolved
MESSAGE=$3
curl -X PATCH "https://api.statuspage.io/v1/pages/12345/incidents/$INCIDENT_ID" \
-H "Authorization: OAuth $STATUSPAGE_TOKEN" \
-H "Content-Type: application/json" \
-d "{
\"incident\": {
\"status\": \"$STATUS\",
\"message\": \"$MESSAGE\",
\"notify_subscribers\": true
}
}"
echo "Status page updated: $STATUS - $MESSAGE"
EOF
chmod +x /scripts/update-status-page.sh
5Recovery & Verification
Ensure complete service recovery.
# Recovery Procedures
# 1. Health Check Verification
cat > /scripts/verify-recovery.sh << 'EOF'
#!/bin/bash
set -e
SERVICE=$1
TIMEOUT=${2:-300} # 5 minutes default
echo "Verifying recovery for $SERVICE..."
# Define health checks
checks=(
"HTTP Health:curl -sf https://$SERVICE/health"
"Database:mysql -e 'SELECT 1' -u app"
"Cache:redis-cli ping"
"Queue:rabbitmqctl status"
)
start_time=$(date +%s)
for check in "${checks[@]}"; do
name=$(echo $check | cut -d: -f1)
cmd=$(echo $check | cut -d: -f2-)
echo "Checking $name..."
for i in {1..30}; do
if eval $cmd > /dev/null 2>&1; then
echo "✓ $name passed"
break
fi
elapsed=$(($(date +%s) - start_time))
if [ $elapsed -ge $TIMEOUT ]; then
echo "✗ $name failed after ${TIMEOUT}s"
exit 1
fi
sleep 10
done
done
# Performance verification
echo "Running performance checks..."
wrk -t2 -c10 -d10 https://$SERVICE/health > /dev/null 2>&1
echo "✓ Recovery verified for $SERVICE"
EOF
chmod +x /scripts/verify-recovery.sh
# 2. Gradual Traffic Restoration
cat > /scripts/restore-traffic.sh << 'EOF'
#!/bin/bash
# Gradual traffic restoration after incident
SERVICE=$1
echo "Restoring traffic for $SERVICE..."
# Phase 1: 10% traffic (5 minutes)
echo "Phase 1: 10% traffic"
kubectl patch virtualservice $SERVICE -p "{
\"spec\": {
\"http\": [{
\"route\": [
{\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"}, \"weight\": 10},
{\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"stable\"}, \"weight\": 90}
]
}]
}
}"
sleep 300
# Check error rate
ERROR_RATE=$(curl -s "https://prometheus.internal/api/v1/query?query=rate(http_requests_total{service=\"$SERVICE\",status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "✗ Error rate too high ($ERROR_RATE). Rolling back."
exit 1
fi
# Phase 2: 50% traffic (5 minutes)
echo "Phase 2: 50% traffic"
kubectl patch virtualservice $SERVICE -p "{
\"spec\": {
\"http\": [{
\"route\": [
{\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"}, \"weight\": 50},
{\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"stable\"}, \"weight\": 50}
]
}]
}
}"
sleep 300
# Phase 3: 100% traffic
echo "Phase 3: 100% traffic"
kubectl patch virtualservice $SERVICE -p "{
\"spec\": {
\"http\": [{
\"route\": [{
\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"},
\"weight\": 100
}]
}]
}
}"
echo "✓ Traffic restoration complete"
EOF
chmod +x /scripts/restore-traffic.sh
# 3. Enhanced Monitoring Post-Incident
cat > /scripts/post-incident-monitoring.sh << 'EOF'
#!/bin/bash
# Set up enhanced monitoring after incident
SERVICE=$1
DURATION=${2:-3600} # 1 hour default
echo "Setting up enhanced monitoring for $SERVICE..."
# Create temporary alert with lower thresholds
cat > /tmp/enhanced-alerts.yaml << ALERT
groups:
- name: post-incident-${SERVICE}
rules:
- alert: ${SERVICE^^}_ErrorRateHigh
expr: sum(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$SERVICE"}[5m])) > 0.005
for: 1m
labels:
severity: warning
annotations:
summary: "Post-incident: Elevated error rate for $SERVICE"
- alert: ${SERVICE^^}_LatencyHigh
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m])) by (le)) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "Post-incident: Elevated latency for $SERVICE"
ALERT
# Apply alerts
kubectl apply -f /tmp/enhanced-alerts.yaml
# Schedule removal after duration
(sleep $DURATION && kubectl delete -f /tmp/enhanced-alerts.yaml) &
echo "Enhanced monitoring active for ${DURATION}s"
EOF
chmod +x /scripts/post-incident-monitoring.sh
6Post-Incident Analysis
Conduct blameless post-mortems and drive improvements.
# Post-Mortem Template
cat > /templates/post-mortem.md << 'EOF'
# Post-Mortem: [Incident Title]
## Executive Summary
**Incident ID:** INC-2026-XXXX-XXX
**Date:** [Date]
**Severity:** [SEV1/SEV2/SEV3]
**Duration:** [X minutes]
**Impact:** [Brief impact summary]
**Summary:** [2-3 sentence summary of what happened, impact, and resolution]
## Timeline
**[Date]**
- [HH:MM] UTC: [Event]
- [HH:MM] UTC: [Event]
- [HH:MM] UTC: [Event]
## Impact Assessment
**Technical Impact:**
- [Metric]: [Value] (normal: [baseline])
- [Metric]: [Value] (normal: [baseline])
**Business Impact:**
- Users affected: [X]
- Revenue impact: $[X]
- Support tickets: [X]
## Root Cause Analysis
**Primary Cause:**
[Detailed explanation of root cause]
**Contributing Factors:**
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]
**Why It Happened (5 Whys):**
1. Why? [Answer]
2. Why? [Answer]
3. Why? [Answer]
4. Why? [Answer]
5. Why? [Answer - Root Cause]
## Lessons Learned
**What Went Well:**
- [Positive 1]
- [Positive 2]
**What Could Improve:**
- [Improvement 1]
- [Improvement 2]
## Action Items
| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Critical fix] | @owner | [Date] | [Status] |
| P1 | [Important fix] | @owner | [Date] | [Status] |
| P2 | [Nice to have] | @owner | [Date] | [Status] |
## Recommendations
**Immediate (This Sprint):**
- [Action]
- [Action]
**Short-Term (Next 30 Days):**
- [Action]
- [Action]
**Long-Term (Next Quarter):**
- [Action]
- [Action]
## Appendix
- [Incident Timeline Link]
- [Monitoring Dashboard Link]
- [Related Documentation]
---
*Post-Mortem completed by [@name] on [date]*
*Reviewed by [@name] on [date]*
EOF
# Automated Post-Mortem Generator
cat > /scripts/generate-post-mortem.sh << 'EOF'
#!/bin/bash
# Generate post-mortem from incident data
INCIDENT_ID=$1
echo "Generating post-mortem for $INCIDENT_ID..."
# Get incident data
INCIDENT_DATA=$(curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID")
# Extract key information
TITLE=$(echo $INCIDENT_DATA | jq -r '.title')
SEVERITY=$(echo $INCIDENT_DATA | jq -r '.severity')
STARTED=$(echo $INCIDENT_DATA | jq -r '.started_at')
RESOLVED=$(echo $INCIDENT_DATA | jq -r '.resolved_at')
DURATION=$(echo $INCIDENT_DATA | jq -r '.duration')
# Get timeline
TIMELINE=$(curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID/timeline")
# Generate post-mortem
cat > /post-mortems/$INCIDENT_ID.md << POSTMORTEM
# Post-Mortem: $TITLE
## Executive Summary
**Incident ID:** $INCIDENT_ID
**Severity:** $SEVERITY
**Started:** $STARTED
**Resolved:** $RESOLVED
**Duration:** $DURATION
**Summary:** [To be filled by Incident Commander]
## Timeline
$TIMELINE
## Impact Assessment
[To be filled]
## Root Cause Analysis
[To be filled]
## Action Items
| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| | | | | |
---
*Generated on $(date)*
POSTMORTEM
echo "Post-mortem template created: /post-mortems/$INCIDENT_ID.md"
EOF
chmod +x /scripts/generate-post-mortem.sh
Best Practices
Incident Response Principles:
- Declare Early: Over-declare rather than under-declare
- Communicate Proactively: Update every 15-30 minutes
- Focus on Recovery: Fix first, investigate later
- Document Everything: Real-time timeline capture
- Blameless Culture: Focus on systems, not people
- Learn and Improve: Complete post-mortems with action items
Assessment
1. What is the first step in incident response?
2. What does SEV1 indicate?
3. Who is responsible for overall incident coordination?
4. What should be the focus during an active incident?
Answer Key: 1-B, 2-C, 3-B, 4-B