System Design: Kubernetes Architecture Deep Dive — Control Plane, etcd, Scheduler, Kubelet, Pod Lifecycle, Networking

Kubernetes is the de facto standard for container orchestration, running production workloads at companies from startups to Google. Understanding its architecture is essential for system design interviews, SRE roles, and backend engineering positions. This guide covers the internal architecture of Kubernetes — how the control plane works, how pods are scheduled, how networking functions, and the design decisions that make Kubernetes scalable and resilient.

Control Plane Components

The Kubernetes control plane makes global decisions about the cluster (scheduling, detecting and responding to events). Components: (1) kube-apiserver — the REST API frontend for all cluster operations. Every kubectl command, every controller action, and every kubelet report goes through the API server. It validates requests, authenticates callers, and persists state to etcd. Horizontally scalable — run multiple instances behind a load balancer. (2) etcd — the distributed key-value store that holds all cluster state: pod definitions, service configurations, secrets, configmaps, node status. Uses the Raft consensus protocol for data replication. The single source of truth for the cluster. (3) kube-scheduler — watches for newly created pods with no assigned node and selects a node for each. Scheduling decisions consider: resource requirements (CPU, memory), node affinity/anti-affinity rules, taints and tolerations, pod topology spread constraints, and inter-pod affinity. (4) kube-controller-manager — runs controller loops that watch cluster state via the API server and make changes to move the current state toward the desired state. Examples: ReplicaSet controller ensures the correct number of pod replicas, Node controller detects and responds to node failures.

Node Components and Pod Lifecycle

Each worker node runs: (1) kubelet — the agent that ensures containers described in PodSpecs are running and healthy. It watches the API server for pods assigned to its node, pulls container images, starts containers via the container runtime (containerd), monitors container health via liveness and readiness probes, and reports pod status back to the API server. (2) kube-proxy — maintains network rules on the node for Service abstraction. Implements iptables or IPVS rules that route traffic destined for a Service ClusterIP to the correct backend pod. (3) Container runtime — the software responsible for running containers. containerd is the default since Kubernetes 1.24 (Docker was removed as a runtime). Pod lifecycle: Pending (accepted but not yet scheduled or image not yet pulled), Running (at least one container is running), Succeeded (all containers terminated successfully), Failed (at least one container terminated with a non-zero exit code), Unknown (pod status cannot be determined, typically a node communication failure).

How the Kubernetes Scheduler Works

The scheduler runs a two-phase algorithm for each unscheduled pod: (1) Filtering — eliminate nodes that cannot run the pod. Filter plugins check: does the node have sufficient CPU and memory (resource fit)? Does the pod tolerate the node taints? Does the node match the pod nodeSelector or nodeAffinity? Are required volumes available on the node? Has the node reached its pod limit? The result is a list of feasible nodes. (2) Scoring — rank the feasible nodes to find the best fit. Scoring plugins assign a score (0-100) to each node based on: resource balance (prefer nodes that would result in even resource utilization across the cluster), affinity/anti-affinity preferences (soft constraints), topology spread (distribute pods evenly across failure domains), and image locality (prefer nodes that already have the container image cached). The node with the highest total score is selected. The scheduler processes approximately 100 pods per second in typical clusters. For very large clusters (5000+ nodes), scheduler throughput is improved by parallelizing the scoring phase and using scheduling profiles.

Kubernetes Networking Model

Kubernetes networking has three fundamental rules: (1) Every pod gets its own IP address. (2) Pods can communicate with any other pod without NAT (network address translation). (3) Agents on a node can communicate with all pods on that node. Implementation: a CNI (Container Network Interface) plugin provides the pod network. Popular CNIs: Calico (uses BGP for routing, supports network policies), Cilium (uses eBPF for high-performance networking and observability), Flannel (simple overlay network using VXLAN). Service networking: a Service is a stable virtual IP (ClusterIP) that load-balances traffic to a set of pods selected by a label selector. When a pod is created or destroyed, the Endpoints controller updates the Service endpoint list. kube-proxy programs iptables/IPVS rules to route Service IP traffic to backend pod IPs. DNS: CoreDNS runs as a Deployment in the cluster and provides DNS resolution. A Service named my-service in namespace default is reachable at my-service.default.svc.cluster.local. Pods resolve service names via the cluster DNS automatically.

Services, Ingress, and External Traffic

Service types: (1) ClusterIP (default) — accessible only within the cluster. Used for internal service-to-service communication. (2) NodePort — exposes the service on a static port on each node. External traffic hits NodeIP:NodePort and is routed to the service. Port range: 30000-32767. (3) LoadBalancer — provisions an external load balancer (cloud provider specific). The load balancer routes traffic to NodePorts, which route to pods. This is how most production services are exposed. (4) ExternalName — maps a service to a DNS name (CNAME record). No proxying. Ingress: an API object that manages external HTTP(S) access to services. An Ingress controller (nginx, Traefik, AWS ALB) watches Ingress resources and configures the underlying load balancer/proxy. Ingress provides: host-based routing (api.example.com -> api-service, web.example.com -> web-service), path-based routing (/api -> api-service, / -> web-service), TLS termination, and rate limiting. Gateway API is the successor to Ingress, providing more expressive routing with HTTPRoute, GRPCRoute, and TCPRoute resources.

Scaling and Resource Management

Horizontal Pod Autoscaler (HPA) adjusts the number of pod replicas based on metrics. Default metric: CPU utilization. HPA loop: every 15 seconds, query the metrics API for current CPU usage across all pods. Compute desired replicas: desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). If CPU target is 50% and current is 80% with 3 replicas: desired = ceil(3 * 80/50) = ceil(4.8) = 5. Scale up to 5 replicas. Custom metrics: scale on request rate, queue depth, or any Prometheus metric via the custom metrics API. Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests/limits for containers. It monitors actual resource usage over time and recommends (or applies) right-sized resource requests. Cluster Autoscaler adds or removes nodes based on pending pods (pods that cannot be scheduled due to insufficient resources) and underutilized nodes (nodes with low resource utilization that can be drained). The combination: HPA scales pods horizontally, VPA right-sizes individual pods, and Cluster Autoscaler scales the infrastructure to match.

etcd and Cluster State Management

etcd stores all Kubernetes state as key-value pairs under the /registry prefix. Example: /registry/pods/default/my-pod stores the full PodSpec as a Protocol Buffer. etcd uses Raft consensus with a 3 or 5 node cluster for high availability. Performance: etcd handles approximately 10,000 writes per second and 100,000 reads per second in a well-tuned deployment. This is sufficient for clusters up to approximately 5,000 nodes. Beyond that, etcd becomes the bottleneck and requires careful tuning: dedicated SSD storage (etcd is I/O bound — fsync latency is the critical metric), separate etcd cluster from the control plane nodes, compaction and defragmentation to manage database size. Watch mechanism: Kubernetes controllers use the etcd watch API (via the API server) to receive notifications when resources change. This event-driven architecture is more efficient than polling. The API server also provides a resource version for optimistic concurrency — clients read a resource, modify it, and send the update with the resource version. If another client modified the resource (version changed), the update is rejected with a 409 Conflict.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does the Kubernetes scheduler decide which node to place a pod on?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “The scheduler runs a two-phase algorithm: filtering and scoring. Filtering eliminates nodes that cannot run the pod. Checks include: does the node have enough CPU and memory (resource requests fit within allocatable capacity)? Does the pod tolerate the node taints (a tainted node repels pods without matching tolerations)? Does the node match the pod nodeSelector or nodeAffinity (e.g., the pod requires a GPU node)? Are the required persistent volumes available on the node? Has the node reached its maximum pod count? After filtering, scoring ranks the remaining feasible nodes. Scoring plugins assign 0-100 points per criteria: LeastRequestedPriority (prefer nodes with more available resources for balanced utilization), InterPodAffinity (prefer nodes where co-located pods already run), ImageLocality (prefer nodes that already have the container image cached, avoiding a pull), and TopologySpreadConstraints (distribute pods evenly across zones). The scores are weighted and summed. The node with the highest total score wins. If multiple nodes tie, one is chosen randomly. The entire scheduling cycle takes 5-20ms per pod.” } }, { “@type”: “Question”, “name”: “What happens when a Kubernetes node fails?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “When a node stops sending heartbeats to the API server, the node controller detects the failure. Timeline: the kubelet sends heartbeats every 10 seconds (NodeStatus updates). The node controller checks every 5 seconds. After node-monitor-grace-period (default 40 seconds) without a heartbeat, the node is marked as Unknown. After pod-eviction-timeout (default 5 minutes), all pods on the Unknown node are marked for eviction. The Deployment or ReplicaSet controller detects that replica count is below the desired count and creates replacement pods. The scheduler assigns the new pods to healthy nodes. Total recovery time: approximately 5-7 minutes from node failure to replacement pods running. To reduce this: lower the pod-eviction-timeout (at the cost of false positives on network blips), use pod disruption budgets to ensure minimum availability during eviction, and configure liveness probes with appropriate thresholds. For stateful workloads (StatefulSets), automatic recovery is more cautious — the system waits longer to avoid split-brain scenarios where the old pod is still running but unreachable.” } }, { “@type”: “Question”, “name”: “How does Kubernetes networking allow every pod to communicate with every other pod?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Kubernetes requires a flat network where every pod can reach every other pod by IP address without NAT. This is implemented by Container Network Interface (CNI) plugins. Three common approaches: (1) Overlay networks (Flannel with VXLAN) — each node gets a subnet (e.g., node 1: 10.244.1.0/24, node 2: 10.244.2.0/24). Pods on node 1 get IPs in 10.244.1.x. Cross-node traffic is encapsulated in VXLAN packets (UDP wrapping the original packet). The receiving node decapsulates and delivers. Simple setup but adds encapsulation overhead (50 bytes per packet). (2) Direct routing (Calico with BGP) — each node announces its pod subnet via BGP (Border Gateway Protocol) to the network infrastructure. Routers learn that 10.244.1.0/24 is reachable via node 1. No encapsulation overhead but requires BGP support from the network. (3) eBPF-based (Cilium) — uses Linux eBPF programs attached to network interfaces for packet routing and filtering. Provides kernel-level networking with high performance and rich observability (per-pod traffic metrics). Service networking is layered on top: kube-proxy programs iptables or IPVS rules to translate Service ClusterIPs to actual pod IPs.” } }, { “@type”: “Question”, “name”: “How does Horizontal Pod Autoscaler work in Kubernetes?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Default metric: CPU utilization. The HPA controller runs a control loop every 15 seconds (configurable via –horizontal-pod-autoscaler-sync-period). Loop: (1) Query the metrics API for current CPU usage of all pods in the target Deployment. (2) Compute the ratio: currentMetricValue / targetMetricValue. Example: target CPU is 50%, current average CPU across 3 pods is 75%. Ratio = 75/50 = 1.5. (3) Compute desired replicas: ceil(currentReplicas * ratio) = ceil(3 * 1.5) = 5. (4) Scale the Deployment to 5 replicas. Scale-up behavior: by default, the HPA can double the replica count in a single step (maxScaleUp policy). Scale-down is more conservative — the HPA waits 5 minutes (stabilization window) after the last scale-up before scaling down, to prevent flapping. Custom metrics: HPA can scale on any Prometheus metric via the custom metrics API (prometheus-adapter). Scale on request rate, queue depth, or business metrics. Multiple metrics: HPA evaluates all configured metrics and uses the one that recommends the highest replica count.” } } ] }
Scroll to Top