Question 1

How does the Kubernetes scheduler decide which node to place a pod on?

Accepted Answer

The scheduler runs a two-phase algorithm: filtering and scoring. Filtering eliminates nodes that cannot run the pod. Checks include: does the node have enough CPU and memory (resource requests fit within allocatable capacity)? Does the pod tolerate the node taints (a tainted node repels pods without matching tolerations)? Does the node match the pod nodeSelector or nodeAffinity (e.g., the pod requires a GPU node)? Are the required persistent volumes available on the node? Has the node reached its maximum pod count? After filtering, scoring ranks the remaining feasible nodes. Scoring plugins assign 0-100 points per criteria: LeastRequestedPriority (prefer nodes with more available resources for balanced utilization), InterPodAffinity (prefer nodes where co-located pods already run), ImageLocality (prefer nodes that already have the container image cached, avoiding a pull), and TopologySpreadConstraints (distribute pods evenly across zones). The scores are weighted and summed. The node with the highest total score wins. If multiple nodes tie, one is chosen randomly. The entire scheduling cycle takes 5-20ms per pod.

Question 2

What happens when a Kubernetes node fails?

Accepted Answer

When a node stops sending heartbeats to the API server, the node controller detects the failure. Timeline: the kubelet sends heartbeats every 10 seconds (NodeStatus updates). The node controller checks every 5 seconds. After node-monitor-grace-period (default 40 seconds) without a heartbeat, the node is marked as Unknown. After pod-eviction-timeout (default 5 minutes), all pods on the Unknown node are marked for eviction. The Deployment or ReplicaSet controller detects that replica count is below the desired count and creates replacement pods. The scheduler assigns the new pods to healthy nodes. Total recovery time: approximately 5-7 minutes from node failure to replacement pods running. To reduce this: lower the pod-eviction-timeout (at the cost of false positives on network blips), use pod disruption budgets to ensure minimum availability during eviction, and configure liveness probes with appropriate thresholds. For stateful workloads (StatefulSets), automatic recovery is more cautious -- the system waits longer to avoid split-brain scenarios where the old pod is still running but unreachable.

Question 3

How does Kubernetes networking allow every pod to communicate with every other pod?

Accepted Answer

Kubernetes requires a flat network where every pod can reach every other pod by IP address without NAT. This is implemented by Container Network Interface (CNI) plugins. Three common approaches: (1) Overlay networks (Flannel with VXLAN) -- each node gets a subnet (e.g., node 1: 10.244.1.0/24, node 2: 10.244.2.0/24). Pods on node 1 get IPs in 10.244.1.x. Cross-node traffic is encapsulated in VXLAN packets (UDP wrapping the original packet). The receiving node decapsulates and delivers. Simple setup but adds encapsulation overhead (50 bytes per packet). (2) Direct routing (Calico with BGP) -- each node announces its pod subnet via BGP (Border Gateway Protocol) to the network infrastructure. Routers learn that 10.244.1.0/24 is reachable via node 1. No encapsulation overhead but requires BGP support from the network. (3) eBPF-based (Cilium) -- uses Linux eBPF programs attached to network interfaces for packet routing and filtering. Provides kernel-level networking with high performance and rich observability (per-pod traffic metrics). Service networking is layered on top: kube-proxy programs iptables or IPVS rules to translate Service ClusterIPs to actual pod IPs.

Question 4

How does Horizontal Pod Autoscaler work in Kubernetes?

Accepted Answer

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Default metric: CPU utilization. The HPA controller runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). Loop: (1) Query the metrics API for current CPU usage of all pods in the target Deployment. (2) Compute the ratio: currentMetricValue / targetMetricValue. Example: target CPU is 50%, current average CPU across 3 pods is 75%. Ratio = 75/50 = 1.5. (3) Compute desired replicas: ceil(currentReplicas * ratio) = ceil(3 * 1.5) = 5. (4) Scale the Deployment to 5 replicas. Scale-up behavior: by default, the HPA can double the replica count in a single step (maxScaleUp policy). Scale-down is more conservative -- the HPA waits 5 minutes (stabilization window) after the last scale-up before scaling down, to prevent flapping. Custom metrics: HPA can scale on any Prometheus metric via the custom metrics API (prometheus-adapter). Scale on request rate, queue depth, or business metrics. Multiple metrics: HPA evaluates all configured metrics and uses the one that recommends the highest replica count.

System Design: Kubernetes Architecture Deep Dive — Control Plane, etcd, Scheduler, Kubelet, Pod Lifecycle, Networking

Control Plane Components

Node Components and Pod Lifecycle

How the Kubernetes Scheduler Works

Kubernetes Networking Model

Services, Ingress, and External Traffic

Scaling and Resource Management

etcd and Cluster State Management