Kubernetes is the de facto standard for container orchestration. Understanding its architecture, scheduling model, and operational patterns is increasingly expected in senior engineering interviews at companies that run microservices at scale.
Kubernetes Architecture
Control Plane (Master):
┌──────────────────────────────────────────────┐
│ API Server — central REST endpoint │
│ etcd — distributed KV store │
│ Scheduler — assigns pods to nodes │
│ Controller Mgr — reconciliation loops │
│ Cloud Controller— cloud provider integration │
└──────────────────────────────────────────────┘
Worker Nodes:
┌──────────────────────────────────────────────┐
│ kubelet — node agent, manages pods │
│ kube-proxy — network rules (iptables) │
│ Container Runtime (containerd / CRI-O) │
│ Pods (1..N) │
└──────────────────────────────────────────────┘
etcd: The Source of Truth
- Stores all cluster state: pod specs, service definitions, configmaps, secrets
- Raft consensus — typically 3 or 5 nodes for HA (tolerates N/2 failures)
- API server is the ONLY component that talks to etcd directly
- Watch mechanism: components subscribe to key prefixes; etcd pushes changes → reactive reconciliation
Pod Lifecycle and Scheduling
Pod scheduling flow:
1. User creates Pod spec → API Server stores in etcd (Pending)
2. Scheduler watches for unscheduled pods
3. Filtering: eliminate nodes that don't satisfy constraints
- Resource requests: node has enough CPU/memory
- Node selectors / affinity rules
- Taints and tolerations
- Pod topology spread constraints
4. Scoring: rank remaining nodes
- Least allocated (spread evenly)
- Image locality (node already has image)
- Inter-pod affinity scores
5. Bind pod to highest-scoring node → API Server updates etcd
6. kubelet on node watches → pulls image → starts container
Resource Requests vs Limits
resources:
requests: # scheduler uses this for placement
cpu: "500m" # 0.5 CPU cores
memory: "256Mi"
limits: # hard cap at runtime
cpu: "1000m" # throttled if exceeded (not killed)
memory: "512Mi" # OOMKilled if exceeded
QoS Classes:
Guaranteed: requests == limits → never evicted under pressure
Burstable: requests < limits → evicted if node is pressured
BestEffort: no requests/limits → evicted first
Vertical Pod Autoscaler (VPA): automatically adjusts requests
Horizontal Pod Autoscaler (HPA): adjusts replica count
Deployments, ReplicaSets, and Rolling Updates
Deployment → manages → ReplicaSet → manages → Pods
Rolling update strategy:
maxUnavailable: 25% # how many pods can be down during update
maxSurge: 25% # how many extra pods can be created
Update flow:
1. New ReplicaSet created with new pod template
2. Scale up new RS by maxSurge pods
3. Scale down old RS by maxUnavailable pods
4. Repeat until new RS = desired, old RS = 0
Rollback:
kubectl rollout undo deployment/my-app
(keeps old ReplicaSet for instant rollback)
Blue-Green via labels:
Service selector: version=blue → route to v1 pods
Deploy v2 pods, test, switch selector: version=green
Zero-downtime cutover
Kubernetes Networking
Network model rules:
- Every pod gets a unique cluster-wide IP
- Pods can communicate with any other pod without NAT
- Nodes can communicate with pods without NAT
Implementation (CNI plugins):
Calico: eBPF or iptables, supports NetworkPolicy, BGP peering
Flannel: simple VXLAN overlay, no NetworkPolicy
Cilium: eBPF-based, L7 policy, Hubble observability
Services (stable VIPs for pods):
ClusterIP: internal VIP, kube-proxy creates iptables rules
NodePort: expose on every node's IP:port (30000-32767)
LoadBalancer: cloud provider creates external LB, maps to NodePort
Headless: no VIP, DNS returns individual pod IPs (for StatefulSets)
DNS within cluster:
my-service.my-namespace.svc.cluster.local → ClusterIP
pod-ip.my-namespace.pod.cluster.local → pod IP
StatefulSets for Stateful Workloads
StatefulSet guarantees (vs Deployment):
- Stable, unique pod names: mysql-0, mysql-1, mysql-2
- Ordered, sequential pod creation (0 → 1 → 2)
- Stable network identity: mysql-0.mysql.default.svc.cluster.local
- Persistent volume per pod (PVC not shared, not deleted on pod delete)
Use cases: databases (MySQL, Cassandra, Kafka, ZooKeeper)
Example: Kafka StatefulSet
kafka-0 → PVC: kafka-data-0 (broker 0)
kafka-1 → PVC: kafka-data-1 (broker 1)
kafka-2 → PVC: kafka-data-2 (broker 2)
Headless service → DNS for each broker separately
Horizontal Pod Autoscaler (HPA)
HPA control loop (every 15s):
desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
Example:
Current: 4 replicas, CPU at 80%
Target: CPU 50%
desired = ceil(4 × 80/50) = ceil(6.4) = 7 replicas → scale up to 7
Metric sources:
Built-in: CPU utilization, memory utilization
Custom: requests/sec, queue depth (via Prometheus + adapter)
External: SQS queue depth, Pub/Sub undelivered messages (KEDA)
Scale-down stabilization (default 5min):
Prevents thrashing — only scale down if needed for 5 consecutive minutes
Kubernetes Observability
The three pillars:
Metrics: Prometheus scrapes /metrics endpoints
→ Grafana dashboards
→ AlertManager → PagerDuty
Logs: stdout/stderr → node log agent (Fluentd/Fluentbit)
→ Elasticsearch or Cloud Logging
→ Structured JSON logs with pod name, namespace, trace_id
Traces: OpenTelemetry SDK in app
→ Collector sidecar or daemonset
→ Jaeger / Tempo / AWS X-Ray
Key metrics to monitor:
Pod: CPU throttling rate, OOMKill count, restart count
Node: allocatable vs requested CPU/memory, eviction rate
Cluster: pending pods (scheduling backlog), API server latency
Common Interview Design Questions
How does Kubernetes handle node failure?
Node Controller detects missing heartbeats. After node-monitor-grace-period (default 40s), node is marked NotReady. After pod-eviction-timeout (default 5 min), pods are evicted (marked for deletion) and rescheduled to healthy nodes. With --pod-eviction-timeout=0 and TaintBasedEvictions enabled, eviction can happen in ~1 minute.
How do you run a database in Kubernetes?
Use StatefulSet + PersistentVolumeClaim (StorageClass: gp3/pd-ssd). For production, use an operator (CloudNativePG, Vitess, CockroachDB operator) that handles replication, failover, and backups. Single-node databases in k8s are fine; multi-node requires operator for coordination. Alternatively, use managed cloud databases (RDS, Cloud SQL) outside k8s for simpler ops.
Kubernetes vs serverless
| Factor | Kubernetes | Serverless (Lambda) |
|---|---|---|
| Cold start | Pod startup ~5-30s | ms to seconds |
| Max duration | Unlimited | 15 min (Lambda) |
| Scaling | HPA (minutes) | Per-request (instant) |
| Cost model | Reserved capacity | Per-invocation |
| Debugging | Full shell access | Limited (logs only) |
| Best for | Long-running services | Event-driven, bursty |
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Kubernetes schedule pods onto nodes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The kube-scheduler watches for unscheduled pods and runs a two-phase process. Filtering eliminates nodes that cannot satisfy the pod’s constraints: insufficient CPU/memory requests, failed node selectors or affinity rules, unmatched taints, or topology spread violations. Scoring ranks remaining nodes using functions like LeastAllocated (spread load evenly), ImageLocality (prefer nodes that already have the container image), and InterPodAffinity (co-locate or separate pods by labels). The pod is bound to the highest-scoring node u2014 the API server records the binding in etcd, and the kubelet on that node pulls the image and starts the container.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a Deployment and a StatefulSet in Kubernetes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deployments are for stateless applications: pods are interchangeable, created/deleted in any order, share the same network identity pattern (random suffix), and share or don’t use persistent volumes. StatefulSets are for stateful applications (databases, Kafka, ZooKeeper): pods get stable, predictable names (mysql-0, mysql-1), are created and deleted in strict order (0u21921u21922 for creation, 2u21921u21920 for deletion), each gets a stable DNS hostname (mysql-0.mysql.default.svc.cluster.local), and each gets its own PersistentVolumeClaim that is not deleted when the pod is deleted. StatefulSets enable databases to maintain stable peer discovery and persistent storage across restarts.”
}
},
{
“@type”: “Question”,
“name”: “How does Kubernetes Horizontal Pod Autoscaler work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The HPA controller runs every 15 seconds and computes: desiredReplicas = ceil(currentReplicas u00d7 currentMetricValue / targetMetricValue). For CPU, if you have 4 replicas averaging 80% CPU with a target of 50%, it scales to ceil(4 u00d7 80/50) = 7 replicas. Metrics come from the metrics-server (built-in CPU/memory), Prometheus Adapter (custom application metrics like requests/sec), or KEDA (external metrics like SQS queue depth). Scale-down has a 5-minute stabilization window by default to prevent thrashing u2014 the HPA only scales down if the lower replica count has been consistently indicated for 5 minutes.”
}
}
]
}