System Design Interview: Kubernetes Autoscaling (HPA, VPA, KEDA, and Cluster Autoscaler)

⏱ 9 min read

Why Autoscaling?

Manual capacity planning is expensive and error-prone. Too much capacity wastes money; too little causes outages under traffic spikes. Kubernetes provides several autoscaling mechanisms that adjust capacity automatically based on observed metrics. Understanding which to use — and when — is a key system design and platform engineering skill.

Horizontal Pod Autoscaler (HPA)

HPA adds or removes pod replicas based on observed metrics. The most common metric is CPU utilization, but HPA supports any metric exposed via the Kubernetes Metrics API.


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70   # scale out when avg CPU > 70%
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 1000       # 1000 RPS per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # scale up immediately
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60              # add at most 4 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5min before scaling down

HPA control loop runs every 15 seconds. It computes: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). At 20 pods running at 90% CPU (target 70%), desiredReplicas = ceil(20 × 90/70) = 26. Scale-down stabilization window (default 5 minutes) prevents thrashing — HPA does not immediately scale down after a traffic spike subsides.

Common mistake: setting CPU requests too low. HPA computes CPU utilization as a percentage of requests, not node capacity. If your pod requests 100m CPU but actually uses 500m, Kubernetes thinks it is at 500% utilization, triggering aggressive scaling. Always set requests to match the pod’s actual steady-state usage.

Vertical Pod Autoscaler (VPA)

VPA adjusts the CPU and memory requests/limits of existing pods based on observed usage. Instead of adding more replicas, VPA right-sizes each pod. Useful when horizontal scaling is impractical (stateful applications, batch jobs) or when pods are consistently over- or under-provisioned.


apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ml-inference
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference
  updatePolicy:
    updateMode: "Auto"   # Off (recommend only), Initial (on pod create), Auto (evict+recreate)
  resourcePolicy:
    containerPolicies:
    - containerName: inference
      minAllowed:
        cpu: 500m
        memory: 1Gi
      maxAllowed:
        cpu: 8
        memory: 16Gi

VPA limitation: Auto mode evicts and recreates pods to apply new resource requests — this causes brief interruptions. In-place pod resource updates (alpha in Kubernetes 1.29) will eventually allow VPA to modify resources without restart. Do NOT use HPA and VPA on CPU/memory simultaneously on the same deployment — they conflict. Use VPA for resource sizing and HPA for replica count.

KEDA: Event-Driven Autoscaling

HPA based on CPU/RPS does not capture queue depth — a burst of 10,000 Kafka messages does not immediately spike CPU. KEDA (Kubernetes Event-Driven Autoscaling) scales workloads based on external event sources: Kafka lag, SQS queue depth, Redis list length, Prometheus metrics, Azure Service Bus, and 50+ others.


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor-deployment
  minReplicaCount: 0    # can scale to zero!
  maxReplicaCount: 100
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka.production:9092
      consumerGroup: order-processor-group
      topic: orders
      lagThreshold: "50"   # one replica per 50 messages of lag
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: pending_jobs_total
      query: sum(pending_jobs_total{queue="critical"})
      threshold: "10"

KEDA can scale to zero replicas when there are no events — perfect for background workers that only need to run when there is work. It scales up instantly when events arrive. This is a significant cost saving for batch processing workloads that are idle most of the day.

Cluster Autoscaler

HPA and KEDA scale pods — but if all nodes are full, new pods will be Pending (no capacity). Cluster Autoscaler watches for Pending pods and provisions new nodes, then removes underutilized nodes (above a threshold, typically 50% utilization) after a cool-down period.

Configuration in AWS EKS: Cluster Autoscaler watches Auto Scaling Groups tagged with kubernetes.io/cluster-autoscaler/enabled. When it decides to scale up, it requests a new EC2 instance from the ASG. Scale-down waits: a node must be underutilized for 10 minutes before eviction; pods with PodDisruptionBudgets are respected. Scale-up is faster (1-3 minutes for a new node to become Ready) than scale-down (10+ minutes — aggressive scale-down can cause performance degradation during traffic rebounds).

Scaling to Zero with KEDA and Spot Instances

Batch processing workloads (nightly ETL, image processing pipelines) are ideal for scale-to-zero with spot/preemptible instances:

During the day: 0 replicas of the batch worker (KEDA minReplicas=0)
Files arrive in S3 → S3 event → SQS queue
KEDA detects SQS queue depth > 0 → scale from 0 to N workers
Workers use spot instances (Karpenter or Cluster Autoscaler with spot node groups — 60-80% cheaper)
Queue drains → KEDA scales back to 0 → Cluster Autoscaler terminates spot nodes

This pattern achieves near-zero idle cost for batch workloads while handling any burst size.

Karpenter: Node Provisioning Alternative

Cluster Autoscaler is tied to Auto Scaling Groups and requires pre-defined node pools. Karpenter (AWS-native) provisions exactly the right EC2 instance type for each workload — it reads pending pod resource requests and launches the most cost-efficient instance (e.g., a c6g.xlarge for CPU-bound pods, r6g.2xlarge for memory-bound). Karpenter provisions nodes in under 60 seconds (vs. 2-3 minutes for Cluster Autoscaler) and supports spot instance consolidation — replacing multiple spot instances with fewer, cheaper ones when prices change.

Key Interview Points

HPA scales replicas based on CPU, memory, or custom metrics; scale-down stabilization window prevents thrashing
Set CPU requests accurately — HPA utilization is percent of requests, not node capacity
VPA right-sizes resource requests; do not combine with HPA on CPU/memory
KEDA scales on external events (Kafka lag, SQS depth) and can scale to zero for batch workloads
Cluster Autoscaler provisions/removes nodes; Karpenter is faster and more cost-efficient for AWS
Spot instances + KEDA scale-to-zero: optimal cost pattern for variable batch workloads

Frequently Asked Questions

How does the Horizontal Pod Autoscaler decide when to scale?

HPA runs a control loop every 15 seconds. It fetches the current metric value from the Kubernetes Metrics API (for CPU/memory) or the Custom Metrics API (for application metrics like RPS or queue depth). It then applies the formula: desiredReplicas = ceil(currentReplicas × currentMetricValue / targetMetricValue). For example, if the target CPU utilization is 70% and 10 pods are running at 105% average utilization, desiredReplicas = ceil(10 × 105/70) = 15 — HPA creates 5 new pods. HPA never scales below minReplicas or above maxReplicas. For scale-down, the stabilization window (default 5 minutes) prevents rapid oscillation: HPA tracks the maximum recommended replica count over the window and does not scale down below that. This means after a traffic spike subsides, HPA waits 5 minutes before reducing replicas, preventing thrashing if traffic is bursty. A critical prerequisite: pods must have CPU requests set. HPA computes utilization as (actual CPU) / (requested CPU) — if requests are too low, even light loads appear as 100%+ utilization and trigger constant scaling.

What is KEDA and how does it differ from HPA?

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA's capabilities to scale workloads based on external event sources beyond CPU and memory. HPA is limited to metrics from the Kubernetes Metrics API (CPU, memory) or Custom Metrics from Prometheus. KEDA provides native scalers for 50+ external systems: Kafka consumer lag, SQS queue depth, Azure Service Bus message count, Redis list length, Cron schedules, and custom Prometheus queries. KEDA's key capability that HPA cannot match: scaling to zero replicas. When no events exist (empty Kafka topic, zero SQS messages), KEDA scales the deployment to 0 pods — no idle compute cost. When events arrive, KEDA scales from 0 to N pods within seconds. HPA has a minimum of 1 replica. This makes KEDA ideal for batch processing and background worker workloads that are idle most of the time. Under the hood, KEDA creates a standard HPA object and manages it — so both can coexist with the right configuration. For event-driven workloads (queue consumers, scheduled batch jobs), KEDA is the correct tool; HPA is best for request-driven services (HTTP APIs) where CPU/RPS correlates with load.

How do you prevent autoscaling from causing a thundering herd when scaling up from zero?

Scaling from zero to many replicas can cause problems if all new pods hit downstream systems simultaneously. Mitigation strategies: (1) Pod startup time management: ensure pods start quickly (< 30 seconds) and are fully initialized before receiving traffic. Use readiness probes so pods only receive traffic when ready. Slow JVM or model-loading startups must complete before the pod is marked Ready. (2) KEDA lagThreshold tuning: configure one replica per N messages (e.g., lagThreshold: 50 means each pod handles 50 messages of lag). KEDA scales gradually — if lag is 500, it provisions 10 replicas, not 500. This controls the scale-up rate. (3) Warm pools: for critical services, keep a minimum of 1-3 pods always running (minReplicaCount: 1 in KEDA) to handle the first burst while new pods spin up. (4) Database connection pooling: new pods immediately opening database connections can exhaust the connection pool. Use PgBouncer/RDS Proxy so new pods connect to the pooler, not directly to the database. (5) Rate limiting at scale-up: HPA's scaleUp behavior policy allows limiting how many pods are added per time period (e.g., maxReplicas: 10 per 60 seconds), preventing a thundering herd when recovering from a scale-to-zero state.

Companies That Ask This Question

Cloudflare Engineering Interview Guide

Atlassian Engineering Interview Guide

HashiCorp Engineering Interview Guide

Netflix Engineering Interview Guide