SRE Fundamentals

Master Site Reliability Engineering best practices: Define SLOs, implement error budgets, set up monitoring with Prometheus and Grafana, and build resilient systems.

⏱️ 45 minutes 📊 Intermediate 📝 5 steps 🏷️ IT Operations

Prerequisites

Before starting this tutorial, ensure you have:
  • Basic Linux command-line knowledge
  • Understanding of web services and HTTP protocols
  • Familiarity with containerization concepts (Docker)
  • Basic knowledge of monitoring concepts
  • A development environment or cloud account (AWS/GCP/Azure)

Learning Objectives

By the end of this tutorial, you will be able to:

  • Define and implement Service Level Indicators (SLIs)
  • Set appropriate Service Level Objectives (SLOs)
  • Calculate and manage error budgets
  • Set up Prometheus for metrics collection
  • Create Grafana dashboards for monitoring
  • Configure alerts based on SLO burn rates

Step-by-Step Guide

1Define Service Level Indicators (SLIs)

SLIs are quantitative measurements of service quality. The four golden signals are:

Key SLIs to Track:

  1. Availability: Percentage of successful requests
    # Example: Calculate availability
    successful_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total{status=\"200\"}")
    total_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total" | wc -l)
    availability=$(echo "scale=4; $successful_requests / $total_requests * 100" | bc)
    echo "Availability: $availability%"
  2. Lateny: Response time percentiles (p50, p95, p99)
    # Prometheus query for latency percentiles
    http_request_duration_seconds_bucket{le="0.1"}  # 100ms
    http_request_duration_seconds_bucket{le="0.5"}  # 500ms
    http_request_duration_seconds_bucket{le="1"}    # 1 second
  3. Throughput: Requests per second
    # Prometheus query for throughput
    rate(http_requests_total[5m])
  4. Saturation: Resource utilization
    # CPU utilization
    1 - rate(node_cpu_seconds_total{mode="idle"}[5m])
    
    # Memory utilization
    (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

2Set Service Level Objectives (SLOs)

SLOs are target values for your SLIs. They should balance reliability with innovation velocity.

Recommended SLO Targets:

Service Tier Availability SLO Latency SLO (p99)
Critical (payments, auth) 99.99% <500ms
High (core features) 99.9% <1000ms
Standard (other features) 99.5% <2000ms

Implementing SLOs in Prometheus:

# Example: 99.9% availability SLO over 30 days
# This calculates the error rate
(
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) < 0.001  # 0.1% error rate = 99.9% availability

3Implement Error Budgets

Error budgets represent the acceptable amount of failure. They drive release velocity decisions.

Error Budget Calculation:

# Error Budget = 100% - SLO Target
# For 99.9% SLO: Error Budget = 0.1%

# Monthly error budget in minutes
# 99.9% availability = 43.8 minutes of downtime allowed per month
total_minutes_in_month=43200  # 30 days * 24 hours * 60 minutes
error_budget_minutes=$(echo "scale=2; $total_minutes_in_month * 0.001" | bc)
echo "Error budget: $error_budget_minutes minutes/month"

Error Budget Burn Rate Alerting:

# Prometheus alert for error budget burn rate
# Alert if burning budget 28x faster than allowed (would exhaust in 24h)
alert ErrorBudgetBurnRateHigh {
  expr: (
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > (0.001 * 28)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning too fast"
    description: "Service is burning error budget at {{ $value }} rate"
}
Error Budget Policy:
  • 0-50% consumed: Normal operations, releases allowed
  • 50-80% consumed: Caution, require additional review
  • 80-100% consumed: Freeze, no non-critical changes
  • >100% consumed: Incident, focus on reliability only

4Set Up Prometheus Monitoring

Prometheus is the industry-standard metrics collection system for SRE.

Installation (Docker):

# Create prometheus.yml configuration
cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'api-service'
    static_configs:
      - targets: ['api:8080']
    metrics_path: '/metrics'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
EOF

# Run Prometheus with Docker
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus_data:/prometheus \
  --name prometheus \
  prom/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus

Add Instrumentation to Your Application:

# Python example with prometheus_client
from prometheus_client import Counter, Histogram, generate_latest
from http.server import BaseHTTPRequestHandler

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            self.send_response(200)
            self.send_header('Content-Type', 'text/plain')
            self.end_headers()
            self.wfile.write(generate_latest())
        else:
            # Your actual handler
            REQUEST_COUNT.labels('GET', self.path, '200').inc()
            # ... handle request ...

# Expose metrics on port 8000
from http.server import HTTPServer
HTTPServer(('0.0.0.0', 8000), MetricsHandler).serve_forever()

5Create Grafana Dashboards

Grafana provides visualization for your Prometheus metrics.

Installation:

# Run Grafana with Docker
docker run -d \
  -p 3000:3000 \
  -v grafana_data:/var/lib/grafana \
  --name grafana \
  --link prometheus:prometheus \
  grafana/grafana

# Default login: admin / admin

Add Prometheus Data Source:

  1. Log into Grafana (http://localhost:3000)
  2. Go to Configuration → Data Sources
  3. Click "Add data source" → Prometheus
  4. URL: http://prometheus:9090
  5. Click "Save & Test"

Create SLO Dashboard Panels:

# Panel 1: Availability (Last 24h)
sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h])) * 100

# Panel 2: Latency Percentiles
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) * 1000

# Panel 3: Error Budget Remaining
(1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) * 100

# Panel 4: Request Rate
sum(rate(http_requests_total[5m]))
Pro Tip: Import the official "Prometheus SLO Dashboard" from Grafana.com (ID: 11753) for a production-ready starting point.

Best Practices

Key SRE Principles:
  • Automate Toil Away: Any repetitive manual task should be automated. If you do it three times, write a script.
  • Embrace Risk: Use error budgets to make data-driven decisions about release velocity vs. reliability.
  • Blameless Culture: Focus on system failures, not human errors. Ask "what allowed this to happen?" not "who caused this?"
  • Measure Everything: You can't improve what you don't measure. Instrument everything.
  • Trade-offs: Reliability is a trade-off, not an absolute. Balance it against velocity and cost.

Google SRE Golden Rules:

  • 50% Rule: SREs should spend no more than 50% of time on operational toil
  • Two Pizza Teams: Teams should be small enough to be fed with two pizzas
  • Automate First: Always consider automation before adding headcount
  • Monitor What Matters: Focus on user-facing metrics, not just infrastructure

Assessment

Test your understanding with these questions:

1. What is the relationship between SLO and error budget?

2. Which of these is NOT one of the four golden signals?

3. If your SLO is 99.9% availability, how many minutes of downtime are allowed per month?

4. What Prometheus query calculates the 99th percentile latency?

5. According to Google SRE, what percentage of time should engineers spend on toil?

Answer Key: 1-B, 2-E (Throughput is the same as Traffic), 3-B, 4-B, 5-B

Resources

Official Documentation:

Further Learning:

Tools: