Low Level Design: Graceful Shutdown

Graceful shutdown ensures a service stops cleanly: completing in-flight requests, draining connections, flushing buffers, and releasing resources before the process exits. Abrupt termination drops in-flight requests, corrupts open file handles, and leaves downstream services with broken connections. Graceful shutdown is a foundational pattern for zero-downtime deployments and rolling restarts.

SIGTERM and Signal Handling

The operating system sends SIGTERM to request graceful shutdown (from kill, systemd, Docker stop, Kubernetes). The application registers a SIGTERM handler that initiates the shutdown sequence. SIGKILL cannot be caught and terminates the process immediately — reserved for forced shutdown after the graceful timeout expires. In Kubernetes, the pod lifecycle is: SIGTERM → terminationGracePeriodSeconds (default 30s) → SIGKILL. The application must complete shutdown within the grace period.

HTTP Server Graceful Shutdown

On SIGTERM: (1) stop accepting new connections (remove the server from the load balancer rotation by failing health checks); (2) wait for existing connections to complete their current request; (3) close connections that are idle (waiting for a new request on keep-alive connections); (4) close the listening socket. Go's http.Server.Shutdown() implements this: it closes the listener immediately, waits for active handlers to complete, and closes idle connections. Pass a context with timeout to bound the wait.

Load Balancer Deregistration

Before stopping, deregister from the load balancer so new requests are not routed to the shutting-down instance. In AWS: call deregister-targets on the ALB target group. In Kubernetes: the pod's readiness probe returns unhealthy on SIGTERM — the kube-proxy removes the pod from Endpoints within a few seconds. Add a preStop hook sleep (5-10s) to allow time for the load balancer to propagate the deregistration before the server stops accepting connections, preventing a brief period where requests are routed to a shutting-down pod.

Connection Draining

Connection draining allows existing connections to complete while rejecting new ones. AWS ALB draining: when a target is deregistered, the ALB sends no new requests but keeps existing connections alive for the deregistration_delay (default 300s, typically set to 30-60s). The application must respond to all requests that arrive during this window. Kubernetes: the terminationGracePeriodSeconds provides the equivalent window. Both mechanisms ensure in-flight requests complete before the process is killed.

Worker and Queue Draining

Background workers (job queues, Kafka consumers) have different shutdown semantics than HTTP servers. On SIGTERM: stop polling for new jobs, finish processing the current job, commit the offset or acknowledge the message, then exit. Do not abandon in-flight jobs — they will be re-queued and retried by another worker. Bound the shutdown time: if a job takes more than N seconds, mark it as stuck and exit; the job will timeout on the queue and be retried. Kafka consumers must commit offsets before shutdown to avoid reprocessing.

Database Connection Cleanup

Database connection pools should be closed during shutdown after active queries complete. Call pool.Close() or pool.Drain() in the shutdown sequence. Without cleanup, the database server accumulates connections from shutting-down instances that remain in TIME_WAIT or half-open state until TCP timeouts clean them up (several minutes). This consumes connection slots and can exhaust the database connection limit during rapid rolling deployments. Explicit connection pool teardown releases connections immediately.

Shutdown Timeout and Force Kill

Set a maximum shutdown time to prevent a stuck process from blocking deployments indefinitely. If graceful shutdown has not completed within the timeout, force kill (SIGKILL or process group kill). Choose the timeout based on the p99 request duration plus connection drain window: for a service with p99 latency of 2 seconds and 30-second load balancer drain, set terminationGracePeriodSeconds to ~60s. Log and alert when SIGKILL is used — it indicates either the timeout is too short or the application has a stuck shutdown path.

Scroll to Top