System Design Interview: Kubernetes and Container Orchestration

Kubernetes is the de facto standard for container orchestration. Understanding its architecture, scheduling model, and operational patterns is increasingly expected in senior engineering interviews at companies that run microservices at scale.

Kubernetes Architecture

Control Plane (Master):
  ┌──────────────────────────────────────────────┐
  │ API Server      — central REST endpoint       │
  │ etcd            — distributed KV store        │
  │ Scheduler       — assigns pods to nodes       │
  │ Controller Mgr  — reconciliation loops        │
  │ Cloud Controller— cloud provider integration  │
  └──────────────────────────────────────────────┘

Worker Nodes:
  ┌──────────────────────────────────────────────┐
  │ kubelet         — node agent, manages pods    │
  │ kube-proxy      — network rules (iptables)    │
  │ Container Runtime (containerd / CRI-O)        │
  │ Pods (1..N)                                   │
  └──────────────────────────────────────────────┘

etcd: The Source of Truth

  • Stores all cluster state: pod specs, service definitions, configmaps, secrets
  • Raft consensus — typically 3 or 5 nodes for HA (tolerates N/2 failures)
  • API server is the ONLY component that talks to etcd directly
  • Watch mechanism: components subscribe to key prefixes; etcd pushes changes → reactive reconciliation

Pod Lifecycle and Scheduling

Pod scheduling flow:
  1. User creates Pod spec → API Server stores in etcd (Pending)
  2. Scheduler watches for unscheduled pods
  3. Filtering: eliminate nodes that don't satisfy constraints
     - Resource requests: node has enough CPU/memory
     - Node selectors / affinity rules
     - Taints and tolerations
     - Pod topology spread constraints
  4. Scoring: rank remaining nodes
     - Least allocated (spread evenly)
     - Image locality (node already has image)
     - Inter-pod affinity scores
  5. Bind pod to highest-scoring node → API Server updates etcd
  6. kubelet on node watches → pulls image → starts container

Resource Requests vs Limits

resources:
  requests:          # scheduler uses this for placement
    cpu: "500m"      # 0.5 CPU cores
    memory: "256Mi"
  limits:            # hard cap at runtime
    cpu: "1000m"     # throttled if exceeded (not killed)
    memory: "512Mi"  # OOMKilled if exceeded

QoS Classes:
  Guaranteed: requests == limits → never evicted under pressure
  Burstable:  requests < limits  → evicted if node is pressured
  BestEffort: no requests/limits → evicted first

Vertical Pod Autoscaler (VPA): automatically adjusts requests
Horizontal Pod Autoscaler (HPA): adjusts replica count

Deployments, ReplicaSets, and Rolling Updates

Deployment → manages → ReplicaSet → manages → Pods

Rolling update strategy:
  maxUnavailable: 25%  # how many pods can be down during update
  maxSurge: 25%        # how many extra pods can be created

Update flow:
  1. New ReplicaSet created with new pod template
  2. Scale up new RS by maxSurge pods
  3. Scale down old RS by maxUnavailable pods
  4. Repeat until new RS = desired, old RS = 0

Rollback:
  kubectl rollout undo deployment/my-app
  (keeps old ReplicaSet for instant rollback)

Blue-Green via labels:
  Service selector: version=blue → route to v1 pods
  Deploy v2 pods, test, switch selector: version=green
  Zero-downtime cutover

Kubernetes Networking

Network model rules:
  - Every pod gets a unique cluster-wide IP
  - Pods can communicate with any other pod without NAT
  - Nodes can communicate with pods without NAT

Implementation (CNI plugins):
  Calico: eBPF or iptables, supports NetworkPolicy, BGP peering
  Flannel: simple VXLAN overlay, no NetworkPolicy
  Cilium:  eBPF-based, L7 policy, Hubble observability

Services (stable VIPs for pods):
  ClusterIP:    internal VIP, kube-proxy creates iptables rules
  NodePort:     expose on every node's IP:port (30000-32767)
  LoadBalancer: cloud provider creates external LB, maps to NodePort
  Headless:     no VIP, DNS returns individual pod IPs (for StatefulSets)

DNS within cluster:
  my-service.my-namespace.svc.cluster.local → ClusterIP
  pod-ip.my-namespace.pod.cluster.local    → pod IP

StatefulSets for Stateful Workloads

StatefulSet guarantees (vs Deployment):
  - Stable, unique pod names: mysql-0, mysql-1, mysql-2
  - Ordered, sequential pod creation (0 → 1 → 2)
  - Stable network identity: mysql-0.mysql.default.svc.cluster.local
  - Persistent volume per pod (PVC not shared, not deleted on pod delete)

Use cases: databases (MySQL, Cassandra, Kafka, ZooKeeper)

Example: Kafka StatefulSet
  kafka-0 → PVC: kafka-data-0 (broker 0)
  kafka-1 → PVC: kafka-data-1 (broker 1)
  kafka-2 → PVC: kafka-data-2 (broker 2)
  Headless service → DNS for each broker separately

Horizontal Pod Autoscaler (HPA)

HPA control loop (every 15s):
  desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))

Example:
  Current: 4 replicas, CPU at 80%
  Target: CPU 50%
  desired = ceil(4 × 80/50) = ceil(6.4) = 7 replicas → scale up to 7

Metric sources:
  Built-in: CPU utilization, memory utilization
  Custom:   requests/sec, queue depth (via Prometheus + adapter)
  External: SQS queue depth, Pub/Sub undelivered messages (KEDA)

Scale-down stabilization (default 5min):
  Prevents thrashing — only scale down if needed for 5 consecutive minutes

Kubernetes Observability

The three pillars:

Metrics: Prometheus scrapes /metrics endpoints
  → Grafana dashboards
  → AlertManager → PagerDuty

Logs: stdout/stderr → node log agent (Fluentd/Fluentbit)
  → Elasticsearch or Cloud Logging
  → Structured JSON logs with pod name, namespace, trace_id

Traces: OpenTelemetry SDK in app
  → Collector sidecar or daemonset
  → Jaeger / Tempo / AWS X-Ray

Key metrics to monitor:
  Pod: CPU throttling rate, OOMKill count, restart count
  Node: allocatable vs requested CPU/memory, eviction rate
  Cluster: pending pods (scheduling backlog), API server latency

Common Interview Design Questions

How does Kubernetes handle node failure?

Node Controller detects missing heartbeats. After node-monitor-grace-period (default 40s), node is marked NotReady. After pod-eviction-timeout (default 5 min), pods are evicted (marked for deletion) and rescheduled to healthy nodes. With --pod-eviction-timeout=0 and TaintBasedEvictions enabled, eviction can happen in ~1 minute.

How do you run a database in Kubernetes?

Use StatefulSet + PersistentVolumeClaim (StorageClass: gp3/pd-ssd). For production, use an operator (CloudNativePG, Vitess, CockroachDB operator) that handles replication, failover, and backups. Single-node databases in k8s are fine; multi-node requires operator for coordination. Alternatively, use managed cloud databases (RDS, Cloud SQL) outside k8s for simpler ops.

Kubernetes vs serverless

Factor Kubernetes Serverless (Lambda)
Cold start Pod startup ~5-30s ms to seconds
Max duration Unlimited 15 min (Lambda)
Scaling HPA (minutes) Per-request (instant)
Cost model Reserved capacity Per-invocation
Debugging Full shell access Limited (logs only)
Best for Long-running services Event-driven, bursty

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Kubernetes schedule pods onto nodes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The kube-scheduler watches for unscheduled pods and runs a two-phase process. Filtering eliminates nodes that cannot satisfy the pod’s constraints: insufficient CPU/memory requests, failed node selectors or affinity rules, unmatched taints, or topology spread violations. Scoring ranks remaining nodes using functions like LeastAllocated (spread load evenly), ImageLocality (prefer nodes that already have the container image), and InterPodAffinity (co-locate or separate pods by labels). The pod is bound to the highest-scoring node u2014 the API server records the binding in etcd, and the kubelet on that node pulls the image and starts the container.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a Deployment and a StatefulSet in Kubernetes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deployments are for stateless applications: pods are interchangeable, created/deleted in any order, share the same network identity pattern (random suffix), and share or don’t use persistent volumes. StatefulSets are for stateful applications (databases, Kafka, ZooKeeper): pods get stable, predictable names (mysql-0, mysql-1), are created and deleted in strict order (0u21921u21922 for creation, 2u21921u21920 for deletion), each gets a stable DNS hostname (mysql-0.mysql.default.svc.cluster.local), and each gets its own PersistentVolumeClaim that is not deleted when the pod is deleted. StatefulSets enable databases to maintain stable peer discovery and persistent storage across restarts.”
}
},
{
“@type”: “Question”,
“name”: “How does Kubernetes Horizontal Pod Autoscaler work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The HPA controller runs every 15 seconds and computes: desiredReplicas = ceil(currentReplicas u00d7 currentMetricValue / targetMetricValue). For CPU, if you have 4 replicas averaging 80% CPU with a target of 50%, it scales to ceil(4 u00d7 80/50) = 7 replicas. Metrics come from the metrics-server (built-in CPU/memory), Prometheus Adapter (custom application metrics like requests/sec), or KEDA (external metrics like SQS queue depth). Scale-down has a 5-minute stabilization window by default to prevent thrashing u2014 the HPA only scales down if the lower replica count has been consistently indicated for 5 minutes.”
}
}
]
}

  • Atlassian Interview Guide
  • Snap Interview Guide
  • Uber Interview Guide
  • LinkedIn Interview Guide
  • Cloudflare Interview Guide
  • Netflix Interview Guide
  • Companies That Ask This

    Scroll to Top