Kubernetes is the de facto standard for container orchestration, running production workloads at companies from startups to Google. Understanding its architecture is essential for system design interviews, SRE roles, and backend engineering positions. This guide covers the internal architecture of Kubernetes — how the control plane works, how pods are scheduled, how networking functions, and the design decisions that make Kubernetes scalable and resilient.
Control Plane Components
The Kubernetes control plane makes global decisions about the cluster (scheduling, detecting and responding to events). Components: (1) kube-apiserver — the REST API frontend for all cluster operations. Every kubectl command, every controller action, and every kubelet report goes through the API server. It validates requests, authenticates callers, and persists state to etcd. Horizontally scalable — run multiple instances behind a load balancer. (2) etcd — the distributed key-value store that holds all cluster state: pod definitions, service configurations, secrets, configmaps, node status. Uses the Raft consensus protocol for data replication. The single source of truth for the cluster. (3) kube-scheduler — watches for newly created pods with no assigned node and selects a node for each. Scheduling decisions consider: resource requirements (CPU, memory), node affinity/anti-affinity rules, taints and tolerations, pod topology spread constraints, and inter-pod affinity. (4) kube-controller-manager — runs controller loops that watch cluster state via the API server and make changes to move the current state toward the desired state. Examples: ReplicaSet controller ensures the correct number of pod replicas, Node controller detects and responds to node failures.
Node Components and Pod Lifecycle
Each worker node runs: (1) kubelet — the agent that ensures containers described in PodSpecs are running and healthy. It watches the API server for pods assigned to its node, pulls container images, starts containers via the container runtime (containerd), monitors container health via liveness and readiness probes, and reports pod status back to the API server. (2) kube-proxy — maintains network rules on the node for Service abstraction. Implements iptables or IPVS rules that route traffic destined for a Service ClusterIP to the correct backend pod. (3) Container runtime — the software responsible for running containers. containerd is the default since Kubernetes 1.24 (Docker was removed as a runtime). Pod lifecycle: Pending (accepted but not yet scheduled or image not yet pulled), Running (at least one container is running), Succeeded (all containers terminated successfully), Failed (at least one container terminated with a non-zero exit code), Unknown (pod status cannot be determined, typically a node communication failure).
How the Kubernetes Scheduler Works
The scheduler runs a two-phase algorithm for each unscheduled pod: (1) Filtering — eliminate nodes that cannot run the pod. Filter plugins check: does the node have sufficient CPU and memory (resource fit)? Does the pod tolerate the node taints? Does the node match the pod nodeSelector or nodeAffinity? Are required volumes available on the node? Has the node reached its pod limit? The result is a list of feasible nodes. (2) Scoring — rank the feasible nodes to find the best fit. Scoring plugins assign a score (0-100) to each node based on: resource balance (prefer nodes that would result in even resource utilization across the cluster), affinity/anti-affinity preferences (soft constraints), topology spread (distribute pods evenly across failure domains), and image locality (prefer nodes that already have the container image cached). The node with the highest total score is selected. The scheduler processes approximately 100 pods per second in typical clusters. For very large clusters (5000+ nodes), scheduler throughput is improved by parallelizing the scoring phase and using scheduling profiles.
Kubernetes Networking Model
Kubernetes networking has three fundamental rules: (1) Every pod gets its own IP address. (2) Pods can communicate with any other pod without NAT (network address translation). (3) Agents on a node can communicate with all pods on that node. Implementation: a CNI (Container Network Interface) plugin provides the pod network. Popular CNIs: Calico (uses BGP for routing, supports network policies), Cilium (uses eBPF for high-performance networking and observability), Flannel (simple overlay network using VXLAN). Service networking: a Service is a stable virtual IP (ClusterIP) that load-balances traffic to a set of pods selected by a label selector. When a pod is created or destroyed, the Endpoints controller updates the Service endpoint list. kube-proxy programs iptables/IPVS rules to route Service IP traffic to backend pod IPs. DNS: CoreDNS runs as a Deployment in the cluster and provides DNS resolution. A Service named my-service in namespace default is reachable at my-service.default.svc.cluster.local. Pods resolve service names via the cluster DNS automatically.
Services, Ingress, and External Traffic
Service types: (1) ClusterIP (default) — accessible only within the cluster. Used for internal service-to-service communication. (2) NodePort — exposes the service on a static port on each node. External traffic hits NodeIP:NodePort and is routed to the service. Port range: 30000-32767. (3) LoadBalancer — provisions an external load balancer (cloud provider specific). The load balancer routes traffic to NodePorts, which route to pods. This is how most production services are exposed. (4) ExternalName — maps a service to a DNS name (CNAME record). No proxying. Ingress: an API object that manages external HTTP(S) access to services. An Ingress controller (nginx, Traefik, AWS ALB) watches Ingress resources and configures the underlying load balancer/proxy. Ingress provides: host-based routing (api.example.com -> api-service, web.example.com -> web-service), path-based routing (/api -> api-service, / -> web-service), TLS termination, and rate limiting. Gateway API is the successor to Ingress, providing more expressive routing with HTTPRoute, GRPCRoute, and TCPRoute resources.
Scaling and Resource Management
Horizontal Pod Autoscaler (HPA) adjusts the number of pod replicas based on metrics. Default metric: CPU utilization. HPA loop: every 15 seconds, query the metrics API for current CPU usage across all pods. Compute desired replicas: desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric)). If CPU target is 50% and current is 80% with 3 replicas: desired = ceil(3 * 80/50) = ceil(4.8) = 5. Scale up to 5 replicas. Custom metrics: scale on request rate, queue depth, or any Prometheus metric via the custom metrics API. Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests/limits for containers. It monitors actual resource usage over time and recommends (or applies) right-sized resource requests. Cluster Autoscaler adds or removes nodes based on pending pods (pods that cannot be scheduled due to insufficient resources) and underutilized nodes (nodes with low resource utilization that can be drained). The combination: HPA scales pods horizontally, VPA right-sizes individual pods, and Cluster Autoscaler scales the infrastructure to match.
etcd and Cluster State Management
etcd stores all Kubernetes state as key-value pairs under the /registry prefix. Example: /registry/pods/default/my-pod stores the full PodSpec as a Protocol Buffer. etcd uses Raft consensus with a 3 or 5 node cluster for high availability. Performance: etcd handles approximately 10,000 writes per second and 100,000 reads per second in a well-tuned deployment. This is sufficient for clusters up to approximately 5,000 nodes. Beyond that, etcd becomes the bottleneck and requires careful tuning: dedicated SSD storage (etcd is I/O bound — fsync latency is the critical metric), separate etcd cluster from the control plane nodes, compaction and defragmentation to manage database size. Watch mechanism: Kubernetes controllers use the etcd watch API (via the API server) to receive notifications when resources change. This event-driven architecture is more efficient than polling. The API server also provides a resource version for optimistic concurrency — clients read a resource, modify it, and send the update with the resource version. If another client modified the resource (version changed), the update is rejected with a 409 Conflict.