Site Reliability Engineering (SRE) and DevOps interviews mix system design, operations knowledge, and cultural questions. Google, Netflix, Uber, and most large tech companies have dedicated SRE teams. Interviewers look for engineering rigor in reliability, deep observability knowledge, and operational maturity (how you handle production incidents).
SLOs, SLIs, and Error Budgets
Definitions
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A metric measuring one aspect of service quality | Request success rate = successful_requests / total_requests |
| SLO (Service Level Objective) | A target value for an SLI over a time window | Success rate ≥ 99.9% over a rolling 28-day window |
| SLA (Service Level Agreement) | Contractual commitment to customers, with consequences for violation | 99.9% uptime or 10% credit refund |
| Error Budget | Allowed downtime before SLO is breached | 99.9% SLO = 43.8 min/month allowed downtime |
# Error budget calculation
slo_target = 0.999 # 99.9% success rate
monthly_minutes = 30 * 24 * 60 # 43,200 minutes
error_budget_minutes = (1 - slo_target) * monthly_minutes
# = 0.001 * 43200 = 43.2 minutes per month
# If the service had 30 minutes of downtime so far this month:
budget_remaining = 43.2 - 30 # = 13.2 minutes remaining
budget_burned_pct = 30 / 43.2 # = 69.4% burned
# Policy: if error budget is 50%+ burned:
# - No new feature releases for the rest of the month
# - Focus engineering effort on reliability
Common SLIs
- Availability: fraction of uptime (HTTP 5xx rate < threshold)
- Latency: P99 request latency < 200ms
- Throughput: requests per second successfully served
- Error rate: fraction of requests that result in errors
- Freshness: data updated within N seconds of source change (relevant for pipelines)
Incident Response
Incident Lifecycle
DETECT → TRIAGE → MITIGATE → RESOLVE → POSTMORTEM
DETECT:
- Alerting fires (PagerDuty, Opsgenie)
- User reports (support tickets, social media)
- Automated synthetic monitoring
- Customer success team
TRIAGE (first 5 minutes — critical):
1. Identify: What is broken? What is the user impact?
2. Scope: How many users affected? Which regions?
3. Declare severity: P0 (all customers down) → P3 (minor issue)
4. Assemble incident commander + communicator
MITIGATE (minimize impact immediately):
- Rollback recent deployment first (fastest fix, lowest risk)
- Reroute traffic away from bad region
- Scale up (if capacity issue)
- Feature flag off (if specific feature is failing)
Goal: restore service FIRST, root cause SECOND
RESOLVE:
- Service back to normal
- Monitoring stable for 30 minutes
- Close incident in system
POSTMORTEM (within 48 hours):
- Blameless — focus on systems, not individuals
- Timeline of events
- Root cause analysis
- Action items with owners and due dates
On-Call Best Practices
Alert quality rules:
1. Every alert must be actionable — if on-call cannot do anything, delete the alert
2. Alerts on symptoms (high error rate, slow latency), not causes (CPU high)
3. Alert on SLO burn rate, not individual metric thresholds
4. P99 latency > 500ms for 5 minutes → alert (not for 1 spike)
On-call health:
- Target: < 2 pages/shift (more = toil, unsustainable)
- On-call rotation: 1 week on, N weeks off (where N = team size - 1)
- Shadow rotations: new engineers shadow before primary on-call
- Runbooks: documented response for every alert
# SLO burn rate alerting (multi-window)
# Alert when burning error budget faster than sustainable rate
# 1-hour window at 14x burn rate
# (14x = at this rate, monthly budget exhausted in 2 days)
alert: error_rate > (1 - slo) * 14
over 1 hour and 5 minutes # fast and slow burn detection
# 6-hour window at 6x burn rate
# (slower burn, catches gradual degradation)
alert: error_rate > (1 - slo) * 6
over 6 hours and 30 minutes
CI/CD Pipeline Design
Commit to Production Pipeline:
1. Code pushed to feature branch
2. [CI] Run: lint, unit tests, integration tests
Duration target: < 5 minutes (fast feedback)
3. Code review (1-2 approvals)
4. Merge to main
5. [CD Build] Build and push Docker image to registry
6. [Deploy to Staging]
- Run smoke tests
- Run E2E tests
Duration: ~15 minutes
7. [Deploy to Production]
- Blue/green or canary deployment
- Monitor error rate and latency for 15 minutes
- Auto-rollback if SLO burn rate exceeds threshold
Deployment strategies:
Rolling: replace pods one at a time (zero downtime, slow)
Blue/Green: two identical environments; switch traffic instantly (fast, expensive)
Canary: send 5% traffic to new version, ramp up if healthy (safest)
Feature flags: deploy code but control activation separately (most flexible)
Observability: Logs, Metrics, Traces
The Three Pillars of Observability:
METRICS (aggregated numerical data):
- Prometheus + Grafana
- RED method: Rate, Error, Duration per service
- USE method: Utilization, Saturation, Errors per resource (CPU, memory, disk)
- Alert on: error rate, P99 latency, request rate drop, memory saturation
LOGS (structured event records):
- ELK stack (Elasticsearch, Logstash, Kibana) or Loki + Grafana
- Structured logging (JSON): always include request_id, user_id, service, level
- Sampling for high-volume services (100% on errors, 1% on success)
- Correlate with trace_id for distributed tracing
TRACES (distributed request flow):
- Jaeger, Zipkin, AWS X-Ray, Datadog APM
- Trace spans individual operations across service boundaries
- Critical for diagnosing: "the request was slow, but WHICH service?"
- Sampling: head-based (decide at start) or tail-based (decide after completion)
# Example structured log entry
{
"timestamp": "2026-01-15T12:34:56.789Z",
"level": "ERROR",
"service": "order-service",
"request_id": "req_abcd1234",
"trace_id": "trace_xyz789",
"user_id": "user_123",
"message": "Payment service timeout",
"duration_ms": 5023,
"endpoint": "POST /v1/orders"
}
Chaos Engineering
Chaos engineering: deliberately inject failures to find weaknesses before real incidents
Netflix Chaos Monkey approach:
- Randomly kill production instances (Chaos Monkey)
- Test regional failover (Chaos Kong — kill an entire AZ)
- Inject latency between services (Latency Monkey)
- Simulate network packet loss
Game Day exercises:
1. Define hypothesis: "If the payments service is degraded for 5s,
orders queue and recover within 60s without data loss"
2. Define abort criteria: stop if error rate > 5% for > 2 min
3. Run experiment in production (small blast radius first)
4. Measure: did the system behave as expected?
5. Fix any surprises found
Tools: Gremlin, Chaos Toolkit, AWS Fault Injection Simulator
Common SRE Interview Questions
- Walk me through a major outage you handled: Use the DETECT-TRIAGE-MITIGATE-RESOLVE-POSTMORTEM framework. Focus on decision-making and what you learned
- How do you set an SLO? Start with what users care about (latency, availability). Set it slightly above your current actual reliability so it is achievable. Measure for 3 months, then tighten. Never set an SLO you cannot currently meet
- What makes a good alert? Actionable (on-call knows what to do), urgent (needs immediate attention), tested (has a runbook), correlated (does not fire on normal traffic spikes)
- How do you reduce toil? Automate any manual, repetitive, scalable-with-load operational work. Toil is bad: it does not improve the service, it drains engineers, and it scales with traffic (not engineering effort)
Frequently Asked Questions
What is the difference between an SLI, SLO, and SLA?
SLI (Service Level Indicator) is a quantitative measure of service behavior — a number you can calculate from your metrics: request success rate, p99 latency, availability percentage. SLO (Service Level Objective) is an internal target you set for an SLI: "99.9% of requests complete in under 200ms over a 30-day rolling window." The SLO is what your team commits to maintain. SLA (Service Level Agreement) is a contractual commitment to customers, usually less strict than your SLO (if your SLO is 99.9%, your SLA might be 99.5% so you have buffer). When you breach an SLA, there are financial penalties — refunds, credits. When you breach an SLO, you spend your error budget and freeze feature work to focus on reliability. The error budget is 1 – SLO: at 99.9% SLO, you have 43.8 minutes of downtime per month to spend.
What does a good incident response process look like?
A well-run incident follows five phases: (1) DETECT — alert fires from monitoring (latency spike, error rate increase, synthetic monitor failure). Mean Time to Detect (MTTD) should be under 5 minutes for P1 incidents. (2) TRIAGE — on-call engineer assesses severity (P1=business-critical, P2=significant degradation, P3=minor). Severity determines response team size and escalation path. (3) MITIGATE — stop the bleeding first, root cause later. Revert the recent deploy, enable a feature flag, shift traffic to a healthy region. Mean Time to Resolve (MTTR) is what you optimize. (4) RESOLVE — service is fully restored, incident is closed in your tracker. (5) POSTMORTEM — blameless write-up within 48 hours: timeline, root cause, contributing factors, action items with owners and deadlines. The postmortem is how incidents prevent future incidents.
What is the four golden signals of monitoring?
Google SRE introduced the four golden signals: (1) Latency — the time it takes to service a request. Track both successful and failed request latency separately (failed requests may be fast but mask real problems). Use percentiles (p50, p95, p99) not averages. (2) Traffic — demand on your system: requests per second, active users, database queries per second. Traffic is context for all other signals. (3) Errors — the rate of requests that fail explicitly (5xx), implicitly (200 with wrong data), or by policy (SLO violation). (4) Saturation — how full your service is: CPU %, memory %, disk I/O queue depth, connection pool utilization. Saturation predicts problems before they cause latency or errors — when CPU hits 80%, latency spikes are imminent. Instrument all four signals for every service tier (load balancer, API, database, cache).
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between an SLI, SLO, and SLA?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SLI (Service Level Indicator) is a quantitative measure of service behavior — a number you can calculate from your metrics: request success rate, p99 latency, availability percentage. SLO (Service Level Objective) is an internal target you set for an SLI: “99.9% of requests complete in under 200ms over a 30-day rolling window.” The SLO is what your team commits to maintain. SLA (Service Level Agreement) is a contractual commitment to customers, usually less strict than your SLO (if your SLO is 99.9%, your SLA might be 99.5% so you have buffer). When you breach an SLA, there are financial penalties — refunds, credits. When you breach an SLO, you spend your error budget and freeze feature work to focus on reliability. The error budget is 1 – SLO: at 99.9% SLO, you have 43.8 minutes of downtime per month to spend.”
}
},
{
“@type”: “Question”,
“name”: “What does a good incident response process look like?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A well-run incident follows five phases: (1) DETECT — alert fires from monitoring (latency spike, error rate increase, synthetic monitor failure). Mean Time to Detect (MTTD) should be under 5 minutes for P1 incidents. (2) TRIAGE — on-call engineer assesses severity (P1=business-critical, P2=significant degradation, P3=minor). Severity determines response team size and escalation path. (3) MITIGATE — stop the bleeding first, root cause later. Revert the recent deploy, enable a feature flag, shift traffic to a healthy region. Mean Time to Resolve (MTTR) is what you optimize. (4) RESOLVE — service is fully restored, incident is closed in your tracker. (5) POSTMORTEM — blameless write-up within 48 hours: timeline, root cause, contributing factors, action items with owners and deadlines. The postmortem is how incidents prevent future incidents.”
}
},
{
“@type”: “Question”,
“name”: “What is the four golden signals of monitoring?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Google SRE introduced the four golden signals: (1) Latency — the time it takes to service a request. Track both successful and failed request latency separately (failed requests may be fast but mask real problems). Use percentiles (p50, p95, p99) not averages. (2) Traffic — demand on your system: requests per second, active users, database queries per second. Traffic is context for all other signals. (3) Errors — the rate of requests that fail explicitly (5xx), implicitly (200 with wrong data), or by policy (SLO violation). (4) Saturation — how full your service is: CPU %, memory %, disk I/O queue depth, connection pool utilization. Saturation predicts problems before they cause latency or errors — when CPU hits 80%, latency spikes are imminent. Instrument all four signals for every service tier (load balancer, API, database, cache).”
}
}
]
}