Question 1

What is the difference between liveness and readiness probes in Kubernetes?

Accepted Answer

Liveness probe: answers "is the process alive and not permanently broken?" Failure causes Kubernetes to restart (kill and replace) the container. Use for detecting deadlocks, infinite loops, or corruption that requires a process restart to recover from. Should be cheap — just check that the event loop responds. Never check dependencies in liveness. Readiness probe: answers "should this instance receive traffic right now?" Failure removes the pod from the Service endpoints (stops new requests routing to it) but does NOT restart the container. Use for dependency health: database connection, required warm-up, circuit breaker open. When the DB recovers, readiness passes again and traffic resumes — no restart needed. The key distinction: liveness is about the process, readiness is about whether the process can serve the current workload.

Question 2

Why should liveness probes never check database connectivity?

Accepted Answer

If liveness checks the database and the database goes down, every pod's liveness probe fails simultaneously. Kubernetes restarts all pods. The pods come back up, check the database (still down), fail liveness again, restart again — an infinite restart loop that prevents any recovery even after the DB comes back. This is called a liveness probe cascade failure. The correct design: liveness checks only that the process itself is responsive (SELECT 1 against an in-process state, or just HTTP 200 from the event loop). Readiness checks the DB — when the DB is down, readiness fails and traffic stops routing to the pod, but the pod stays running and recovers automatically when the DB comes back online.

Question 3

How do you implement graceful shutdown using health check probes?

Accepted Answer

On SIGTERM: (1) Immediately set a flag that causes readiness to return 503. This removes the pod from the load balancer's rotation within one probe interval (typically 5 seconds). (2) Wait for in-flight requests to drain (sleep 15-30 seconds or poll active request count until zero). (3) Close DB connections and other resources. (4) Exit. The readiness probe does the signaling work — as soon as it returns 503, no new requests are routed to the pod. The drain period handles in-flight requests that started before the endpoint was removed. Without this pattern, Kubernetes sends SIGTERM and immediately removes the pod from endpoints, potentially dropping in-flight requests mid-processing.

Question 4

What should be included in a health check response body?

Accepted Answer

For liveness: minimal — just {"status": "ok"}. Speed matters more than information. For readiness: include per-dependency status and latency. A useful format: {"status": "healthy", "checks": {"database": {"status": "healthy", "latency_ms": 4.2}, "redis": {"status": "degraded", "latency_ms": 145.0}}, "version": "2.3.1", "uptime_seconds": 3600}. Dependency latencies are more actionable than pass/fail alone — a DB check that passes in 2000ms signals an emerging problem even though it technically succeeds. Include the app version for verifying deployments. Expose uptime to detect unexpected restarts. Never include secrets, PII, or internal IP addresses in health check responses — they are typically unauthenticated.

Question 5

How do you configure health check thresholds to avoid flapping?

Accepted Answer

Flapping: a probe passes, fails, passes, fails — causing a pod to be repeatedly added and removed from the load balancer. Prevent with successThreshold and failureThreshold: in Kubernetes readiness, failureThreshold=2 means the pod is removed after 2 consecutive failures (not a single failure). successThreshold=1 means it is re-added after 1 success. For liveness, typical settings: failureThreshold=3, periodSeconds=10 — the pod is restarted after 30 seconds of consecutive failures. Never set failureThreshold=1 for liveness — a single transient network error would cause an unnecessary restart. For readiness, shorter thresholds are acceptable since the consequence (traffic stop) is less severe than a restart.

Health Check Endpoint Low-Level Design: Liveness, Readiness, and Kubernetes Probes

Liveness vs Readiness vs Startup Probes

Health Check Implementation

Kubernetes Probe Configuration

Avoiding Common Health Check Mistakes

Key Interview Points