SRE Fundamentals
Master Site Reliability Engineering best practices: Define SLOs, implement error budgets, set up monitoring with Prometheus and Grafana, and build resilient systems.
Prerequisites
Before starting this tutorial, ensure you have:
- Basic Linux command-line knowledge
- Understanding of web services and HTTP protocols
- Familiarity with containerization concepts (Docker)
- Basic knowledge of monitoring concepts
- A development environment or cloud account (AWS/GCP/Azure)
Learning Objectives
By the end of this tutorial, you will be able to:
- Define and implement Service Level Indicators (SLIs)
- Set appropriate Service Level Objectives (SLOs)
- Calculate and manage error budgets
- Set up Prometheus for metrics collection
- Create Grafana dashboards for monitoring
- Configure alerts based on SLO burn rates
Step-by-Step Guide
1Define Service Level Indicators (SLIs)
SLIs are quantitative measurements of service quality. The four golden signals are:
Key SLIs to Track:
- Availability: Percentage of successful requests
# Example: Calculate availability successful_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total{status=\"200\"}") total_requests=$(curl -s https://api.service.com/metrics | grep "http_requests_total" | wc -l) availability=$(echo "scale=4; $successful_requests / $total_requests * 100" | bc) echo "Availability: $availability%" - Lateny: Response time percentiles (p50, p95, p99)
# Prometheus query for latency percentiles http_request_duration_seconds_bucket{le="0.1"} # 100ms http_request_duration_seconds_bucket{le="0.5"} # 500ms http_request_duration_seconds_bucket{le="1"} # 1 second - Throughput: Requests per second
# Prometheus query for throughput rate(http_requests_total[5m]) - Saturation: Resource utilization
# CPU utilization 1 - rate(node_cpu_seconds_total{mode="idle"}[5m]) # Memory utilization (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
2Set Service Level Objectives (SLOs)
SLOs are target values for your SLIs. They should balance reliability with innovation velocity.
Recommended SLO Targets:
| Service Tier | Availability SLO | Latency SLO (p99) |
|---|---|---|
| Critical (payments, auth) | 99.99% | <500ms |
| High (core features) | 99.9% | <1000ms |
| Standard (other features) | 99.5% | <2000ms |
Implementing SLOs in Prometheus:
# Example: 99.9% availability SLO over 30 days
# This calculates the error rate
(
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
) < 0.001 # 0.1% error rate = 99.9% availability
3Implement Error Budgets
Error budgets represent the acceptable amount of failure. They drive release velocity decisions.
Error Budget Calculation:
# Error Budget = 100% - SLO Target
# For 99.9% SLO: Error Budget = 0.1%
# Monthly error budget in minutes
# 99.9% availability = 43.8 minutes of downtime allowed per month
total_minutes_in_month=43200 # 30 days * 24 hours * 60 minutes
error_budget_minutes=$(echo "scale=2; $total_minutes_in_month * 0.001" | bc)
echo "Error budget: $error_budget_minutes minutes/month"
Error Budget Burn Rate Alerting:
# Prometheus alert for error budget burn rate
# Alert if burning budget 28x faster than allowed (would exhaust in 24h)
alert ErrorBudgetBurnRateHigh {
expr: (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (0.001 * 28)
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
description: "Service is burning error budget at {{ $value }} rate"
}
Error Budget Policy:
- 0-50% consumed: Normal operations, releases allowed
- 50-80% consumed: Caution, require additional review
- 80-100% consumed: Freeze, no non-critical changes
- >100% consumed: Incident, focus on reliability only
4Set Up Prometheus Monitoring
Prometheus is the industry-standard metrics collection system for SRE.
Installation (Docker):
# Create prometheus.yml configuration
cat > prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'api-service'
static_configs:
- targets: ['api:8080']
metrics_path: '/metrics'
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
EOF
# Run Prometheus with Docker
docker run -d \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
-v prometheus_data:/prometheus \
--name prometheus \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus
Add Instrumentation to Your Application:
# Python example with prometheus_client
from prometheus_client import Counter, Histogram, generate_latest
from http.server import BaseHTTPRequestHandler
# Define metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'HTTP request duration')
class MetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
self.send_response(200)
self.send_header('Content-Type', 'text/plain')
self.end_headers()
self.wfile.write(generate_latest())
else:
# Your actual handler
REQUEST_COUNT.labels('GET', self.path, '200').inc()
# ... handle request ...
# Expose metrics on port 8000
from http.server import HTTPServer
HTTPServer(('0.0.0.0', 8000), MetricsHandler).serve_forever()
5Create Grafana Dashboards
Grafana provides visualization for your Prometheus metrics.
Installation:
# Run Grafana with Docker
docker run -d \
-p 3000:3000 \
-v grafana_data:/var/lib/grafana \
--name grafana \
--link prometheus:prometheus \
grafana/grafana
# Default login: admin / admin
Add Prometheus Data Source:
- Log into Grafana (http://localhost:3000)
- Go to Configuration → Data Sources
- Click "Add data source" → Prometheus
- URL:
http://prometheus:9090 - Click "Save & Test"
Create SLO Dashboard Panels:
# Panel 1: Availability (Last 24h)
sum(rate(http_requests_total{status!~"5.."}[24h])) / sum(rate(http_requests_total[24h])) * 100
# Panel 2: Latency Percentiles
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) * 1000
# Panel 3: Error Budget Remaining
(1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) * 100
# Panel 4: Request Rate
sum(rate(http_requests_total[5m]))
Pro Tip: Import the official "Prometheus SLO Dashboard" from Grafana.com (ID: 11753) for a production-ready starting point.
Best Practices
Key SRE Principles:
- Automate Toil Away: Any repetitive manual task should be automated. If you do it three times, write a script.
- Embrace Risk: Use error budgets to make data-driven decisions about release velocity vs. reliability.
- Blameless Culture: Focus on system failures, not human errors. Ask "what allowed this to happen?" not "who caused this?"
- Measure Everything: You can't improve what you don't measure. Instrument everything.
- Trade-offs: Reliability is a trade-off, not an absolute. Balance it against velocity and cost.
Google SRE Golden Rules:
- 50% Rule: SREs should spend no more than 50% of time on operational toil
- Two Pizza Teams: Teams should be small enough to be fed with two pizzas
- Automate First: Always consider automation before adding headcount
- Monitor What Matters: Focus on user-facing metrics, not just infrastructure
Assessment
Test your understanding with these questions:
1. What is the relationship between SLO and error budget?
2. Which of these is NOT one of the four golden signals?
3. If your SLO is 99.9% availability, how many minutes of downtime are allowed per month?
4. What Prometheus query calculates the 99th percentile latency?
5. According to Google SRE, what percentage of time should engineers spend on toil?
Answer Key: 1-B, 2-E (Throughput is the same as Traffic), 3-B, 4-B, 5-B
Resources
Official Documentation:
- 📖 Google SRE Book - The definitive guide to Site Reliability Engineering
- 📊 Prometheus Documentation - Complete Prometheus reference
- 📈 Grafana Documentation - Visualization and dashboards
- 🔍 OpenTelemetry - Observability standards
Further Learning:
- SRE Book - Table of Contents
- Site Reliability Engineering (O'Reilly)
- SLO Calculator
- Grafana Dashboard Library