Incident Response

Master NIST incident response lifecycle: detection, triage, containment, mitigation, and recovery. Learn to lead effective incident response operations.

⏱️ 55 minutes 📊 Intermediate 📝 6 steps 🏷️ Incident Management

Prerequisites

Basic understanding of IT systems
Familiarity with monitoring tools
Knowledge of your organization's systems
Access to incident management tools

Learning Objectives

Understand the NIST incident response lifecycle
Classify incidents by severity
Lead incident response operations
Implement containment and mitigation strategies
Communicate effectively during incidents
Conduct post-incident analysis

Step-by-Step Guide

1Incident Detection & Classification

Learn to detect and classify incidents quickly.

# Incident Detection Sources

# 1. Automated Monitoring
# - Alerting systems (PagerDuty, Opsgenie)
# - SIEM alerts (Splunk, ELK)
# - Application monitoring (Datadog, New Relic)
# - Infrastructure monitoring (Prometheus, CloudWatch)

# 2. User Reports
# - Support tickets
# - Status page reports
# - Social media mentions
# - Customer emails

# 3. Security Alerts
# - IDS/IPS alerts
# - Threat intelligence feeds
# - Vulnerability scans
# - Anomaly detection

# Severity Classification
cat > /policies/incident-severity.md << 'EOF'
# Incident Severity Classification

## SEV1 - Critical
**Definition:** Complete service outage or data loss
**Response Time:** Immediate page
**Examples:**
- Service completely unavailable for >10% of users
- Data loss or corruption
- Security breach with data exposure
- Payment processing failure

## SEV2 - Major
**Definition:** Major degradation with significant impact
**Response Time:** <15 minutes
**Examples:**
- 30-50% error rate
- Performance 5x slower than normal
- Core feature unavailable
- Affecting >10% of users

## SEV3 - Minor
**Definition:** Minor degradation with limited impact
**Response Time:** <1 hour
**Examples:**
- <10% error rate
- Non-core feature unavailable
- Cosmetic issues
- Affecting <10% of users

## SEV4 - Trivial
**Definition:** No user impact, internal issue
**Response Time:** Business hours
**Examples:**
- Internal tool issues
- Documentation errors
- Feature requests
- Single user issues
EOF

# Automated Incident Creation
cat > /scripts/create-incident.sh << 'EOF'
#!/bin/bash
# Automated incident creation from alerts

ALERT_SERVICE=$1
ALERT_SEVERITY=$2
ALERT_MESSAGE=$3

# Map alert severity to incident severity
case $ALERT_SEVERITY in
  critical) INCIDENT_SEV="sev1" ;;
  error) INCIDENT_SEV="sev2" ;;
  warning) INCIDENT_SEV="sev3" ;;
  *) INCIDENT_SEV="sev4" ;;
esac

# Create incident
INCIDENT_ID=$(curl -s -X POST https://api.incident.io/v2/incidents \
  -H "Authorization: Bearer $INCIDENT_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"title\": \"${ALERT_SERVICE}: ${ALERT_MESSAGE}\",
    \"severity\": \"$INCIDENT_SEV\",
    \"service_id\": \"$ALERT_SERVICE\",
    \"monitor_id\": \"$ALERT_SERVICE-monitor\"
  }" | jq -r '.id')

echo "Created incident: $INCIDENT_ID"

# Notify Slack
curl -X POST $SLACK_INCIDENT_WEBHOOK \
  -H "Content-Type: application/json" \
  -d "{
    \"text\": \"🚨 New $INCIDENT_SEV incident: $ALERT_SERVICE\",
    \"attachments\": [{
      \"color\": \"$([ \"$INCIDENT_SEV\" = \"sev1\" ] && echo \"danger\" || echo \"warning\")\",
      \"fields\": [
        {\"title\": \"Service\", \"value\": \"$ALERT_SERVICE\", \"short\": true},
        {\"title\": \"Severity\", \"value\": \"$INCIDENT_SEV\", \"short\": true},
        {\"title\": \"Message\", \"value\": \"$ALERT_MESSAGE\"}
      ]
    }]
  }"
EOF
chmod +x /scripts/create-incident.sh

2Incident Response Team Setup

Establish roles and responsibilities.

# Incident Response Team Structure

# Roles and Responsibilities
cat > /policies/incident-roles.md << 'EOF'
# Incident Response Roles

## Incident Commander (IC)
**Responsibilities:**
- Overall incident coordination
- Decision-making authority
- Resource allocation
- Communication management
- Escalation decisions

**Skills Required:**
- Calm under pressure
- Decision-making
- Communication
- System knowledge

## Technical Lead
**Responsibilities:**
- Technical diagnosis
- Mitigation strategy
- Implementation oversight
- Technical decisions
- Root cause analysis

**Skills Required:**
- Deep system knowledge
- Debugging expertise
- Architecture understanding

## Communications Lead
**Responsibilities:**
- Internal stakeholder updates
- Customer communications
- Status page updates
- Social media management
- Executive briefings

**Skills Required:**
- Clear writing
- Stakeholder management
- Empathy

## Scribe
**Responsibilities:**
- Timeline documentation
- Decision logging
- Communication capture
- Evidence collection
- Post-mortem data gathering

**Skills Required:**
- Attention to detail
- Organization
- Note-taking

## On-Call Engineer
**Responsibilities:**
- First responder
- Initial investigation
- Escalation
- Runbook execution

**Skills Required:**
- System familiarity
- Troubleshooting
- Runbook knowledge
EOF

# Escalation Matrix
cat > /policies/escalation-matrix.yaml << 'EOF'
escalation:
  sev1:
    initial:
      - on-call-engineer
      - incident-commander
    15min:
      - engineering-manager
    30min:
      - vp-engineering
    60min:
      - cto
      
  sev2:
    initial:
      - on-call-engineer
    30min:
      - incident-commander
    60min:
      - engineering-manager
      
  sev3:
    initial:
      - on-call-engineer
    2hours:
      - incident-commander
      
  sev4:
    initial:
      - on-call-engineer
    business_hours_only: true

contacts:
  on-call-engineer:
    pagerduty: engineering-oncall
    slack: "@oncall"
    
  incident-commander:
    pagerduty: incident-commander
    slack: "@ic"
    
  engineering-manager:
    slack: "@eng-manager"
    phone: "+1-xxx-xxx-xxxx"
    
  vp-engineering:
    slack: "@vp-eng"
    phone: "+1-xxx-xxx-xxxx"
    
  cto:
    slack: "@cto"
    phone: "+1-xxx-xxx-xxxx"
EOF

3Containment Strategies

Implement effective containment measures.

# Containment Strategies

# 1. Service Isolation
kubectl label pod affected-pod incident=isolated
kubectl cordon node-affected

# 2. Traffic Diversion
# Enable circuit breaker
kubectl set env deployment/api-service CIRCUIT_BREAKER=enabled

# Route traffic away
kubectl patch virtualservice api-vs -p '{
  "spec": {
    "http": [{
      "route": [{
        "destination": {"host": "api", "subset": "stable"},
        "weight": 100
      }]
    }]
  }
}'

# 3. Rate Limiting
# Emergency rate limiting
kubectl patch configmap ratelimit-config -p '{
  "data": {
    "requests_per_second": "100",
    "burst_size": "50"
  }
}'

# 4. Feature Flags
# Disable problematic feature
curl -X PATCH https://api.flagsmith.com/api/v1/features/disable-feature/ \
  -H "Authorization: Bearer $FLAGS_API_KEY" \
  -d '{"feature_id": 123, "enabled": false}'

# 5. Database Protection
# Read-only mode for database
mysql -e "SET GLOBAL super_read_only = ON;"

# Block specific queries
cat > /etc/iptables/rules.v4 << 'EOF'
# Block suspicious traffic
-A INPUT -s 192.168.1.100 -j DROP
-A INPUT -p tcp --dport 3306 -s !10.0.0.0/8 -j DROP
EOF
iptables-restore < /etc/iptables/rules.v4

# 6. Automated Containment Script
cat > /scripts/emergency-containment.sh << 'EOF'
#!/bin/bash
set -e

SERVICE=$1
ACTION=${2:-"circuit-breaker"}

echo "Initiating emergency containment for $SERVICE..."

case $ACTION in
  circuit-breaker)
    echo "Enabling circuit breaker..."
    kubectl set env deployment/$SERVICE CIRCUIT_BREAKER=enabled
    ;;
    
  scale-down)
    echo "Scaling down to reduce impact..."
    kubectl scale deployment/$SERVICE --replicas=1
    ;;
    
  rollback)
    echo "Rolling back to previous version..."
    kubectl rollout undo deployment/$SERVICE
    ;;
    
  maintenance)
    echo "Enabling maintenance mode..."
    kubectl set env deployment/$SERVICE MAINTENANCE_MODE=true
    ;;
    
  isolate)
    echo "Isolating affected pods..."
    kubectl label pods -l app=$SERVICE incident=isolated --overwrite
    ;;
    
  *)
    echo "Unknown action: $ACTION"
    exit 1
    ;;
esac

# Notify team
curl -X POST $SLACK_WEBHOOK -d "{
  \"text\": \"🚨 Emergency containment initiated for $SERVICE ($ACTION)\"
}"

echo "Containment complete"
EOF
chmod +x /scripts/emergency-containment.sh

4Communication During Incidents

Communicate effectively with all stakeholders.

# Incident Communication Templates

cat > /templates/incident-communications.md << 'EOF'
# Incident Communication Templates

## Initial Notification (0-5 min)
**Channel:** Slack #incidents, #exec-alerts
**Template:**
```
🚨 INCIDENT DECLARED: [SEV1/SEV2/SEV3] - [Brief Description]

Service: [service-name]
Severity: [SEV1/SEV2/SEV3]
Started: [timestamp]
Impact: [Brief impact description]

Incident Channel: #incident-[id]
Incident Commander: [@name]
Status: Investigating

Next update in 15 minutes.
```

## Status Page Update (Every 15-30 min)
**Channel:** Status page
**Template:**
```
[Investigating/Identified/Monitoring/Resolved] - [Service Name]

We are [investigating/working on] an issue affecting [service]. 
[Optional: Brief technical details for technical audience]

[Optional: Estimated resolution time]

Last updated: [timestamp]
```

## Executive Briefing (SEV1 only, Every 30 min)
**Channel:** Slack #exec-alerts, Email to exec team
**Template:**
```
INCIDENT UPDATE - [SEV1] [Service Name]

Time: [timestamp]
Duration: [X minutes]

Current Status: [Investigating/Identified/Mitigating/Resolved]

Impact:
- [X]% of users affected
- [Description of impact]
- [Revenue impact if known]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Next action]
- [ETA for resolution]

Incident Commander: [@name]
```

## Customer Communication (If significant impact)
**Channel:** Email, in-app notification
**Template:**
```
Subject: Service Update: [Issue Description]

Dear [Customer],

We're experiencing an issue affecting [service] that may impact 
your ability to [action].

What we know:
- [Known information]
- [Impact description]

What we're doing:
- [Actions being taken]
- [ETA if available]

We'll update you within [timeframe] with more information.

Apologies for the inconvenience.

The [Company] Team
```

## Resolution Notification
**Channel:** All channels used during incident
**Template:**
```
✅ INCIDENT RESOLVED: [SEV1/SEV2/SEV3] - [Brief Description]

Service: [service-name]
Severity: [SEV1/SEV2/SEV3]
Duration: [X minutes]
Root Cause: [Brief description]

Resolution: [What was done to fix]

Post-Mortem: Will be available within 24 hours at [link]

Thank you to everyone who helped resolve this incident.
```
EOF

# Automated Status Page Updates
cat > /scripts/update-status-page.sh << 'EOF'
#!/bin/bash
# Automated status page updates

INCIDENT_ID=$1
STATUS=$2  # investigating, identified, monitoring, resolved
MESSAGE=$3

curl -X PATCH "https://api.statuspage.io/v1/pages/12345/incidents/$INCIDENT_ID" \
  -H "Authorization: OAuth $STATUSPAGE_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"incident\": {
      \"status\": \"$STATUS\",
      \"message\": \"$MESSAGE\",
      \"notify_subscribers\": true
    }
  }"

echo "Status page updated: $STATUS - $MESSAGE"
EOF
chmod +x /scripts/update-status-page.sh

5Recovery & Verification

Ensure complete service recovery.

# Recovery Procedures

# 1. Health Check Verification
cat > /scripts/verify-recovery.sh << 'EOF'
#!/bin/bash
set -e

SERVICE=$1
TIMEOUT=${2:-300}  # 5 minutes default

echo "Verifying recovery for $SERVICE..."

# Define health checks
checks=(
  "HTTP Health:curl -sf https://$SERVICE/health"
  "Database:mysql -e 'SELECT 1' -u app"
  "Cache:redis-cli ping"
  "Queue:rabbitmqctl status"
)

start_time=$(date +%s)

for check in "${checks[@]}"; do
  name=$(echo $check | cut -d: -f1)
  cmd=$(echo $check | cut -d: -f2-)
  
  echo "Checking $name..."
  
  for i in {1..30}; do
    if eval $cmd > /dev/null 2>&1; then
      echo "✓ $name passed"
      break
    fi
    
    elapsed=$(($(date +%s) - start_time))
    if [ $elapsed -ge $TIMEOUT ]; then
      echo "✗ $name failed after ${TIMEOUT}s"
      exit 1
    fi
    
    sleep 10
  done
done

# Performance verification
echo "Running performance checks..."
wrk -t2 -c10 -d10 https://$SERVICE/health > /dev/null 2>&1

echo "✓ Recovery verified for $SERVICE"
EOF
chmod +x /scripts/verify-recovery.sh

# 2. Gradual Traffic Restoration
cat > /scripts/restore-traffic.sh << 'EOF'
#!/bin/bash
# Gradual traffic restoration after incident

SERVICE=$1

echo "Restoring traffic for $SERVICE..."

# Phase 1: 10% traffic (5 minutes)
echo "Phase 1: 10% traffic"
kubectl patch virtualservice $SERVICE -p "{
  \"spec\": {
    \"http\": [{
      \"route\": [
        {\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"}, \"weight\": 10},
        {\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"stable\"}, \"weight\": 90}
      ]
    }]
  }
}"

sleep 300

# Check error rate
ERROR_RATE=$(curl -s "https://prometheus.internal/api/v1/query?query=rate(http_requests_total{service=\"$SERVICE\",status=~\"5..\"}[5m])" | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
  echo "✗ Error rate too high ($ERROR_RATE). Rolling back."
  exit 1
fi

# Phase 2: 50% traffic (5 minutes)
echo "Phase 2: 50% traffic"
kubectl patch virtualservice $SERVICE -p "{
  \"spec\": {
    \"http\": [{
      \"route\": [
        {\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"}, \"weight\": 50},
        {\"destination\": {\"host\": \"$SERVICE\", \"subset\": \"stable\"}, \"weight\": 50}
      ]
    }]
  }
}"

sleep 300

# Phase 3: 100% traffic
echo "Phase 3: 100% traffic"
kubectl patch virtualservice $SERVICE -p "{
  \"spec\": {
    \"http\": [{
      \"route\": [{
        \"destination\": {\"host\": \"$SERVICE\", \"subset\": \"canary\"},
        \"weight\": 100
      }]
    }]
  }
}"

echo "✓ Traffic restoration complete"
EOF
chmod +x /scripts/restore-traffic.sh

# 3. Enhanced Monitoring Post-Incident
cat > /scripts/post-incident-monitoring.sh << 'EOF'
#!/bin/bash
# Set up enhanced monitoring after incident

SERVICE=$1
DURATION=${2:-3600}  # 1 hour default

echo "Setting up enhanced monitoring for $SERVICE..."

# Create temporary alert with lower thresholds
cat > /tmp/enhanced-alerts.yaml << ALERT
groups:
- name: post-incident-${SERVICE}
  rules:
  - alert: ${SERVICE^^}_ErrorRateHigh
    expr: sum(rate(http_requests_total{service="$SERVICE",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="$SERVICE"}[5m])) > 0.005
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Post-incident: Elevated error rate for $SERVICE"
      
  - alert: ${SERVICE^^}_LatencyHigh
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$SERVICE"}[5m])) by (le)) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Post-incident: Elevated latency for $SERVICE"
ALERT

# Apply alerts
kubectl apply -f /tmp/enhanced-alerts.yaml

# Schedule removal after duration
(sleep $DURATION && kubectl delete -f /tmp/enhanced-alerts.yaml) &

echo "Enhanced monitoring active for ${DURATION}s"
EOF
chmod +x /scripts/post-incident-monitoring.sh

6Post-Incident Analysis

Conduct blameless post-mortems and drive improvements.

# Post-Mortem Template

cat > /templates/post-mortem.md << 'EOF'
# Post-Mortem: [Incident Title]

## Executive Summary
**Incident ID:** INC-2026-XXXX-XXX  
**Date:** [Date]  
**Severity:** [SEV1/SEV2/SEV3]  
**Duration:** [X minutes]  
**Impact:** [Brief impact summary]

**Summary:** [2-3 sentence summary of what happened, impact, and resolution]

## Timeline
**[Date]**
- [HH:MM] UTC: [Event]
- [HH:MM] UTC: [Event]
- [HH:MM] UTC: [Event]

## Impact Assessment
**Technical Impact:**
- [Metric]: [Value] (normal: [baseline])
- [Metric]: [Value] (normal: [baseline])

**Business Impact:**
- Users affected: [X]
- Revenue impact: $[X]
- Support tickets: [X]

## Root Cause Analysis
**Primary Cause:**
[Detailed explanation of root cause]

**Contributing Factors:**
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]

**Why It Happened (5 Whys):**
1. Why? [Answer]
2. Why? [Answer]
3. Why? [Answer]
4. Why? [Answer]
5. Why? [Answer - Root Cause]

## Lessons Learned

**What Went Well:**
- [Positive 1]
- [Positive 2]

**What Could Improve:**
- [Improvement 1]
- [Improvement 2]

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | [Critical fix] | @owner | [Date] | [Status] |
| P1 | [Important fix] | @owner | [Date] | [Status] |
| P2 | [Nice to have] | @owner | [Date] | [Status] |

## Recommendations

**Immediate (This Sprint):**
- [Action]
- [Action]

**Short-Term (Next 30 Days):**
- [Action]
- [Action]

**Long-Term (Next Quarter):**
- [Action]
- [Action]

## Appendix
- [Incident Timeline Link]
- [Monitoring Dashboard Link]
- [Related Documentation]

---
*Post-Mortem completed by [@name] on [date]*
*Reviewed by [@name] on [date]*
EOF

# Automated Post-Mortem Generator
cat > /scripts/generate-post-mortem.sh << 'EOF'
#!/bin/bash
# Generate post-mortem from incident data

INCIDENT_ID=$1

echo "Generating post-mortem for $INCIDENT_ID..."

# Get incident data
INCIDENT_DATA=$(curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID")

# Extract key information
TITLE=$(echo $INCIDENT_DATA | jq -r '.title')
SEVERITY=$(echo $INCIDENT_DATA | jq -r '.severity')
STARTED=$(echo $INCIDENT_DATA | jq -r '.started_at')
RESOLVED=$(echo $INCIDENT_DATA | jq -r '.resolved_at')
DURATION=$(echo $INCIDENT_DATA | jq -r '.duration')

# Get timeline
TIMELINE=$(curl -s "https://api.incident.io/v2/incidents/$INCIDENT_ID/timeline")

# Generate post-mortem
cat > /post-mortems/$INCIDENT_ID.md << POSTMORTEM
# Post-Mortem: $TITLE

## Executive Summary
**Incident ID:** $INCIDENT_ID
**Severity:** $SEVERITY
**Started:** $STARTED
**Resolved:** $RESOLVED
**Duration:** $DURATION

**Summary:** [To be filled by Incident Commander]

## Timeline
$TIMELINE

## Impact Assessment
[To be filled]

## Root Cause Analysis
[To be filled]

## Action Items
| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| | | | | |

---
*Generated on $(date)*
POSTMORTEM

echo "Post-mortem template created: /post-mortems/$INCIDENT_ID.md"
EOF
chmod +x /scripts/generate-post-mortem.sh

Best Practices

Incident Response Principles:

Declare Early: Over-declare rather than under-declare
Communicate Proactively: Update every 15-30 minutes
Focus on Recovery: Fix first, investigate later
Document Everything: Real-time timeline capture
Blameless Culture: Focus on systems, not people
Learn and Improve: Complete post-mortems with action items

Resources

Incident Response Tutorial · ← Back to Tutorials