Autoscaler: Core Responsibility
An autoscaler dynamically adjusts the number of compute instances (replicas, pods, VMs) running a service based on observed load. The goal is to have just enough capacity to handle traffic without over-provisioning. Under-provisioning causes latency spikes and errors; over-provisioning wastes money. Getting the scaling algorithm right requires balancing responsiveness to load changes against stability (avoiding thrashing — rapidly scaling up and down).
Scaling Metrics
The autoscaler needs a signal that reliably indicates whether capacity is too low or too high:
- CPU utilization: The most common signal. Target 60–70% average CPU across all replicas — high enough to be efficient, low enough to absorb traffic spikes before new instances are ready.
- Memory utilization: Useful for memory-bound services but dangerous as a primary signal — memory rarely decreases quickly (GC behavior, caching), making it a poor scale-down trigger.
- Queue depth: For worker services consuming from a queue (Kafka, SQS), scale based on messages-per-consumer. Target: queue depth / desired_msgs_per_consumer = target replicas.
- Requests per second per instance: For stateless HTTP services, RPS per instance is a clean, application-level signal unaffected by CPU efficiency differences between instance types.
- Custom metrics: Business-level metrics via adapter APIs (Kubernetes Custom Metrics API, KEDA). Examples: active database connections, cache hit rate, pending jobs.
Desired Replica Calculation
The Kubernetes HPA formula for metric-based scaling:
desired_replicas = ceil(current_replicas * (current_metric / target_metric))
# Example: 10 replicas at 80% CPU, target 60%
desired_replicas = ceil(10 * (80 / 60)) = ceil(13.33) = 14
Always ceil (round up) for scale-up, floor for scale-down — it is safer to have one extra replica than one too few. Apply min and max bounds: max(min_replicas, min(desired, max_replicas)).
Scale-Up Bias
Scale up faster than scale down. Traffic spikes are dangerous; scale-down is just expensive. Typical policy: scale up immediately when the metric exceeds threshold; scale down only after the metric has been below threshold for a sustained period. This asymmetry is controlled by separate cooldown windows and the stabilization window.
Cooldown Windows
Cooldown windows prevent thrashing by enforcing a minimum time between scaling actions:
- Scale-up cooldown: 60 seconds — short, because under-capacity is urgent. Wait for new instances to start and receive traffic before deciding to scale further.
- Scale-down cooldown: 300 seconds — longer, because scale-down should only happen when load is sustainably lower, not just momentarily quiet between bursts.
During cooldown, the autoscaler continues observing metrics but does not change replica count. After cooldown expires, it recalculates and applies the new desired count.
Stabilization Window
The stabilization window addresses oscillation around the threshold. Instead of acting on the current metric value, use the maximum observed metric value over the last N seconds for scale-up decisions (most conservative: scale to handle the peak seen recently), and the minimum observed value for scale-down (most conservative: scale down only when load has been low for the entire window).
-- For scale-up: use worst-case metric in window
effective_metric = max(metrics_in_last_60s)
-- For scale-down: use best-case metric in window
effective_metric = min(metrics_in_last_300s)
This prevents scaling down after a 5-second quiet period during a generally busy hour.
Predictive Scaling
Reactive scaling always lags — instances take 60–300 seconds to start, meaning traffic spikes are absorbed by existing capacity at degraded performance. Predictive scaling pre-scales based on historical patterns:
- Collect historical metric data with timestamps.
- Build a time-series model (simple: same-time-last-week average; advanced: Facebook Prophet, LSTM).
- Forecast load 10 minutes ahead. If the forecast exceeds current capacity at target utilization, start new instances now.
- Combine with reactive scaling: predictive handles known patterns (daily traffic curves), reactive handles unexpected spikes.
Vertical Scaling and Right-Sizing
Horizontal autoscaling (more instances) addresses throughput. Vertical scaling (larger instance type) addresses per-request resource needs. Vertical Pod Autoscaler (VPA) observes actual CPU and memory usage over time and recommends (or automatically applies) resource requests and limits. Right-sizing prevents scenarios where CPU-based HPA scales to 20 replicas because each replica was allocated only 0.1 CPU but actually needs 0.5 CPU.
Scale-to-Zero
For dev/staging environments or infrequently used services, scale to zero replicas when idle (no traffic for N minutes) and scale up from zero on incoming request. KEDA and Knative support scale-to-zero. The trade-off: cold start latency (first request after idle period waits for instance startup). Mitigate with a keepalive minimum of 1 replica during business hours.
Safe Instance Draining Before Termination
Abruptly terminating instances mid-request causes errors. Safe scale-down draining:
- Mark the instance as draining: stop sending new requests (deregister from load balancer target group).
- Wait for in-flight requests to complete (connection draining timeout: 30–60 seconds).
- Send SIGTERM to the process; application starts graceful shutdown.
- Wait for graceful shutdown (process exits or SIGKILL timeout fires after 30 seconds).
- Terminate the instance.
In Kubernetes, this is the terminationGracePeriodSeconds and preStop hook pattern. The preStop hook runs before SIGTERM, giving the app time to complete warm connections before receiving the termination signal.
Trade-offs and Failure Modes
- Metric lag: CPU metrics from the cloud provider can be 30–60 seconds stale. Use application-level metrics (RPS from the load balancer) for faster signal.
- Thundering herd on scale-up: When new instances start and register with the load balancer, they may receive a surge of queued requests before their caches are warm. Ramp up traffic to new instances gradually using weighted routing.
- Min replicas floor: Always keep at least 2 replicas for high-availability (one can be upgraded or fail without downtime). Never scale to 1 in production.
- Max replicas ceiling: Set a realistic max to prevent runaway cost from a bug that generates infinite load. Monitor when the ceiling is hit — it means the service needs architectural capacity work, not just more instances.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is the desired replica count calculated from current metrics?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Horizontal Pod Autoscaler formula is desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)), applied independently per metric with the result taking the maximum across all metrics. For multi-metric autoscaling, each metric's desired replica count is computed separately and the highest value wins, ensuring no single metric is starved.”
}
},
{
“@type”: “Question”,
“name”: “How do cooldown windows prevent autoscaler thrashing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Scale-up and scale-down cooldown periods (e.g., 3 minutes up, 5 minutes down) block additional scaling actions after a recent event, giving the new replica count time to stabilize and metrics time to reflect the changed capacity. Asymmetric cooldowns are common because scale-down carries higher risk and pods need time to drain in-flight requests before termination.”
}
},
{
“@type”: “Question”,
“name”: “How does predictive scaling differ from reactive scaling?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reactive scaling triggers based on observed metric threshold breaches and always lags behind actual demand by at least one evaluation interval. Predictive scaling uses time-series forecasting (e.g., Facebook Prophet or LSTM models) trained on historical load patterns to pre-provision capacity before traffic surges arrive, eliminating the cold-start latency that reactive scaling cannot avoid.”
}
},
{
“@type”: “Question”,
“name”: “How is connection draining handled before instance termination?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a scale-down event selects an instance for termination, the autoscaler signals the load balancer to stop sending new connections to that instance and waits for a configurable draining timeout (e.g., 30 seconds) for in-flight requests to complete. After the drain window expires, the instance receives a SIGTERM and a final SIGKILL after a grace period, ensuring no active requests are abruptly dropped during deregistration.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering