SRE Fundamentals

Master Site Reliability Engineering best practices: Define SLOs, implement error budgets, set up monitoring with Prometheus and Grafana, and build resilient systems.

⏱️ 45 minutes 📊 Intermediate 📝 5 steps 🏷️ IT Operations

Prerequisites

Before starting this tutorial, ensure you have:

Basic Linux command-line knowledge
Understanding of web services and HTTP protocols
Familiarity with containerization concepts (Docker)
Basic knowledge of monitoring concepts
A development environment or cloud account (AWS/GCP/Azure)

Learning Objectives

By the end of this tutorial, you will be able to:

Define and implement Service Level Indicators (SLIs)
Set appropriate Service Level Objectives (SLOs)
Calculate and manage error budgets
Set up Prometheus for metrics collection
Create Grafana dashboards for monitoring
Configure alerts based on SLO burn rates

Step-by-Step Guide

1Define Service Level Indicators (SLIs)

SLIs are quantitative measurements of service quality. The four golden signals are:

Key SLIs to Track:

Availability: Percentage of successful requests

# Example: Calculate availability
successful_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total{status=\"200\"}")
total_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total" | wc -l)
availability=$(echo "scale=4; $successful_requests / $total_requests * 100" | bc)
echo "Availability: $availability%"

Lateny: Response time percentiles (p50, p95, p99)

# Prometheus query for latency percentiles
http_request_duration_seconds_bucket{le="0.1"}  # 100ms
http_request_duration_seconds_bucket{le="0.5"}  # 500ms
http_request_duration_seconds_bucket{le="1"}    # 1 second

Throughput: Requests per second

# Prometheus query for throughput
rate(http_requests_total[5m])

Saturation: Resource utilization

# CPU utilization
1 - rate(node_cpu_seconds_total{mode="idle"}[5m])

# Memory utilization
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

2Set Service Level Objectives (SLOs)

SLOs are target values for your SLIs. They should balance reliability with innovation velocity.

Recommended SLO Targets:

Service Tier	Availability SLO	Latency SLO (p99)
Critical (payments, auth)	99.99%	<500ms
High (core features)	99.9%	<1000ms
Standard (other features)	99.5%	<2000ms

Implementing SLOs in Prometheus:

# Example: 99.9% availability SLO over 30 days
# This calculates the error rate
(
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) < 0.001  # 0.1% error rate = 99.9% availability

3Implement Error Budgets

Error budgets represent the acceptable amount of failure. They drive release velocity decisions.

Error Budget Calculation:

# Error Budget = 100% - SLO Target
# For 99.9% SLO: Error Budget = 0.1%

# Monthly error budget in minutes
# 99.9% availability = 43.8 minutes of downtime allowed per month
total_minutes_in_month=43200  # 30 days * 24 hours * 60 minutes
error_budget_minutes=$(echo "scale=2; $total_minutes_in_month * 0.001" | bc)
echo "Error budget: $error_budget_minutes minutes/month"

Error Budget Burn Rate Alerting:

# Prometheus alert for error budget burn rate
# Alert if burning budget 28x faster than allowed (would exhaust in 24h)
alert ErrorBudgetBurnRateHigh {
  expr: (
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  ) > (0.001 * 28)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning too fast"
    description: "Service is burning error budget at {{ $value }} rate"
}

Error Budget Policy:

0-50% consumed: Normal operations, releases allowed
50-80% consumed: Caution, require additional review
80-100% consumed: Freeze, no non-critical changes
>100% consumed: Incident, focus on reliability only

4Set Up Prometheus Monitoring

Prometheus is the industry-standard metrics collection system for SRE.

Installation (Docker):

# Create prometheus.yml configuration
cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'api-service'
    static_configs:
      - targets: ['api:8080']
    metrics_path: '/metrics'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
EOF

# Run Prometheus with Docker
docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus_data:/prometheus \
  --name prometheus \
  prom/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus

Add Instrumentation to Your Application:

# Python example with prometheus_client
from prometheus_client import Counter, Histogram, generate_latest
from http.server import BaseHTTPRequestHandler

# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            self.send_response(200)
            self.send_header('Content-Type', 'text/plain')
            self.end_headers()
            self.wfile.write(generate_latest())
        else:
            # Your actual handler
            REQUEST_COUNT.labels('GET', self.path, '200').inc()
            # ... handle request ...

# Expose metrics on port 8000
from http.server import HTTPServer
HTTPServer(('0.0.0.0', 8000), MetricsHandler).serve_forever()

5Create Grafana Dashboards

Grafana provides visualization for your Prometheus metrics.

Installation:

# Run Grafana with Docker
docker run -d \
  -p 3000:3000 \
  -v grafana_data:/var/lib/grafana \
  --name grafana \
  --link prometheus:prometheus \
  grafana/grafana

# Default login: admin / admin

Add Prometheus Data Source:

Log into Grafana (http://localhost:3000)
Go to Configuration → Data Sources
Click "Add data source" → Prometheus
URL: http://prometheus:9090
Click "Save & Test"

Create SLO Dashboard Panels:

# Panel 1: Availability (Last 24h)
sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h])) * 100

# Panel 2: Latency Percentiles
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) * 1000

# Panel 3: Error Budget Remaining
(1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) * 100

# Panel 4: Request Rate
sum(rate(http_requests_total[5m]))

Pro Tip: Import the official "Prometheus SLO Dashboard" from Grafana.com (ID: 11753) for a production-ready starting point.

Best Practices

Key SRE Principles:

Automate Toil Away: Any repetitive manual task should be automated. If you do it three times, write a script.
Embrace Risk: Use error budgets to make data-driven decisions about release velocity vs. reliability.
Blameless Culture: Focus on system failures, not human errors. Ask "what allowed this to happen?" not "who caused this?"
Measure Everything: You can't improve what you don't measure. Instrument everything.
Trade-offs: Reliability is a trade-off, not an absolute. Balance it against velocity and cost.

SRE Fundamentals

Prerequisites

Learning Objectives

Step-by-Step Guide

1Define Service Level Indicators (SLIs)

Key SLIs to Track:

2Set Service Level Objectives (SLOs)

Recommended SLO Targets:

Implementing SLOs in Prometheus:

3Implement Error Budgets

Error Budget Calculation:

Error Budget Burn Rate Alerting:

4Set Up Prometheus Monitoring

Installation (Docker):

Add Instrumentation to Your Application:

5Create Grafana Dashboards

Installation:

Add Prometheus Data Source:

Create SLO Dashboard Panels:

Best Practices

Google SRE Golden Rules:

Assessment

1. What is the relationship between SLO and error budget?

2. Which of these is NOT one of the four golden signals?

3. If your SLO is 99.9% availability, how many minutes of downtime are allowed per month?

4. What Prometheus query calculates the 99th percentile latency?

5. According to Google SRE, what percentage of time should engineers spend on toil?

Resources

Official Documentation:

Further Learning:

Tools: