Connection Draining Low-Level Design: Graceful Shutdown, In-Flight Request Completion, and Health Signal Coordination

What Is Connection Draining?

Connection draining is the process of gracefully removing a server from service by allowing its in-flight requests to complete before shutting it down. Without draining, a hard shutdown drops all active connections, causing request failures for clients mid-flight.

Draining is essential for zero-downtime deployments, rolling restarts, autoscaling scale-in events, and maintenance windows. It is a first-class concern in any production service deployment system.

Drain Sequence

The canonical drain sequence has three phases:

Signal drain: the orchestrator (Kubernetes, ECS, load balancer) marks the instance as draining. The health check endpoint immediately starts returning 503. The load balancer stops routing new connections to this instance.
Wait for in-flight: the server continues to handle existing connections and in-flight requests. An atomic counter tracks the number of in-flight requests.
Shutdown: when the in-flight counter reaches zero (or the drain timeout expires), the process exits cleanly.

Load Balancer Health Signal

The health check endpoint is the communication channel between the draining server and the load balancer. During normal operation, it returns 200. When draining begins, it immediately returns 503 — even before any in-flight requests finish. This tells the load balancer to stop sending new traffic without waiting for existing connections to close.

Health check interval and failure threshold determine how quickly the LB stops routing. For fast drain, use a short health check interval (e.g., 5 seconds) and a low failure threshold (e.g., 1 consecutive failure). This minimizes the window during which new requests can still be routed to the draining instance.

In-Flight Request Tracking

An atomic integer counter is incremented when each request starts and decremented when it completes (including error paths). The drain wait loop checks this counter:

while inflight_count > 0 and elapsed < drain_timeout:
    sleep(100ms)

Using an atomic counter (or a thread-safe semaphore) is critical — race conditions between request completion and counter decrement can cause premature shutdown or indefinite wait.

Drain Timeout

A configurable drain timeout (e.g., 30 seconds) caps how long the server waits for in-flight requests. After the timeout, remaining connections are forcefully closed. The timeout must be set with the expected p99 request duration in mind — if your slowest requests take 10 seconds, a 30-second drain timeout is reasonable. Kubernetes's terminationGracePeriodSeconds must be set longer than the application drain timeout to give the drain logic time to run before the container is SIGKILLed.

Long-Polling and WebSocket Handling

Persistent connections (long-poll, SSE, WebSocket) do not complete in the normal request cycle. Special handling is required:

WebSocket: on drain start, send a close frame to all active WebSocket connections. Clients should reconnect to another instance. Wait for the close handshake to complete.
Long-poll: return an empty response immediately to all pending long-poll handlers, signaling clients to re-poll against another instance.
SSE: send a retry event directing clients to reconnect after a short delay.

These connections should be excluded from the standard in-flight counter or tracked in a separate persistent-connection counter with its own shutdown path.

Deployment Coordination

In a Kubernetes rolling deployment:

Kubernetes sends SIGTERM to the container.
The application signal handler sets is_draining = True and starts returning 503 from the health check.
The application waits for in-flight requests to complete (up to drain timeout).
The process exits with code 0.
Kubernetes confirms process exit and proceeds to terminate the pod.

The preStop hook in the pod spec can add a short sleep (e.g., 5 seconds) before SIGTERM is sent, giving the LB time to process the 503 and stop routing before the drain starts.

SQL Schema

CREATE TABLE ServerInstance (
    id               SERIAL PRIMARY KEY,
    host             TEXT NOT NULL,
    port             INT NOT NULL,
    status           TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active','draining','stopped')),
    inflight_count   INT NOT NULL DEFAULT 0,
    drain_started_at TIMESTAMPTZ,
    drained_at       TIMESTAMPTZ,
    UNIQUE (host, port)
);

CREATE TABLE DrainEvent (
    id          SERIAL PRIMARY KEY,
    instance_id INT REFERENCES ServerInstance(id),
    event_type  TEXT NOT NULL,
    timestamp   TIMESTAMPTZ NOT NULL DEFAULT now(),
    details     JSONB
);

CREATE INDEX idx_drain_instance ON DrainEvent (instance_id, timestamp);

Python Implementation Sketch

import threading, time, signal

class DrainManager:
    def __init__(self, db, instance_id: int, drain_timeout: int = 30):
        self.db = db
        self.instance_id = instance_id
        self.drain_timeout = drain_timeout
        self._inflight = 0
        self._lock = threading.Lock()
        self._draining = False
        signal.signal(signal.SIGTERM, self._handle_sigterm)

    def is_healthy(self) -> bool:
        return not self._draining

    def begin_drain(self):
        self._draining = True
        self.db.execute(
            "UPDATE ServerInstance SET status = 'draining', drain_started_at = now() WHERE id = %s",
            (self.instance_id,)
        )
        self._log_event('drain_begin', {'inflight': self._inflight})

    def track_request_start(self):
        with self._lock:
            self._inflight += 1
            self.db.execute(
                "UPDATE ServerInstance SET inflight_count = inflight_count + 1 WHERE id = %s",
                (self.instance_id,)
            )

    def track_request_end(self):
        with self._lock:
            self._inflight -= 1
            self.db.execute(
                "UPDATE ServerInstance SET inflight_count = inflight_count - 1 WHERE id = %s",
                (self.instance_id,)
            )

    def await_drain_complete(self, timeout: int = None) -> bool:
        timeout = timeout or self.drain_timeout
        deadline = time.time() + timeout
        while time.time() < deadline:
            with self._lock:
                if self._inflight <= 0:
                    break
            time.sleep(0.1)
        success = self._inflight <= 0
        self.db.execute(
            "UPDATE ServerInstance SET status = 'stopped', drained_at = now() WHERE id = %s",
            (self.instance_id,)
        )
        self._log_event('drain_complete', {'success': success, 'remaining_inflight': self._inflight})
        return success

    def _handle_sigterm(self, signum, frame):
        self.begin_drain()
        self.await_drain_complete()
        raise SystemExit(0)

    def _log_event(self, event_type: str, details: dict):
        self.db.execute(
            "INSERT INTO DrainEvent (instance_id, event_type, details) VALUES (%s, %s, %s)",
            (self.instance_id, event_type, details)
        )

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What happens when the drain timeout is exceeded?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When the drain timeout expires, the server forcefully closes remaining in-flight connections and exits. In Kubernetes, this means the process exits before terminationGracePeriodSeconds is reached. Clients with active connections receive a connection reset. To minimize impact, the drain timeout should be set above the p99 request duration, and long-running operations should implement their own cancellation logic triggered by the drain signal.”
}
},
{
“@type”: “Question”,
“name”: “How are long-polling and WebSocket connections handled during drain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Persistent connections must be explicitly closed during drain because they do not complete in the normal request cycle. For WebSocket, the server sends a close frame and waits for the client close handshake. For long-poll, the server returns an immediate empty response so clients re-poll against another instance. For SSE, a retry directive tells clients to reconnect. These are tracked in a separate persistent-connection counter with their own drain path.”
}
},
{
“@type”: “Question”,
“name”: “How does the load balancer health signal coordinate with drain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The health check endpoint is the primary signal. On drain start, it immediately returns 503 — before any in-flight requests complete. The load balancer detects the 503 within one health check interval and stops routing new traffic to the instance. The drain logic then waits for existing in-flight requests to finish. A short preStop hook sleep (5s) before SIGTERM ensures the LB has time to process the 503 and stop routing before the drain timer starts.”
}
},
{
“@type”: “Question”,
“name”: “How does connection draining enable zero-downtime deployment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a rolling deployment, new instances are started and pass health checks before old instances are drained. The load balancer routes new traffic to healthy new instances while old instances drain their existing connections. Because the drain completes before the old process exits, no in-flight requests are dropped. The result is a seamless handoff with no client-visible errors, assuming the drain timeout exceeds the longest running request.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does connection draining signal in-flight requests?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a backend is marked for drain, the load balancer stops routing new connections to it while allowing existing connections to complete; for HTTP/2 and gRPC, the server sends a GOAWAY frame with the last processed stream ID so clients know which requests must be retried on a different backend. The draining server continues processing requests on open connections until all finish or the drain timeout elapses.”
}
},
{
“@type”: “Question”,
“name”: “How does a load balancer detect drain state?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The backend signals drain intent by failing its health-check endpoint (returning 503) or by updating a service-registry entry with a draining status flag, which the load balancer's health-polling loop detects within one polling interval. Cloud load balancers (e.g., AWS ALB) also support a deregistration delay setting that enforces a wait period after an instance is deregistered before traffic is fully stopped.”
}
},
{
“@type”: “Question”,
“name”: “What happens to long-lived connections during drain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Long-lived connections such as WebSockets or streaming RPCs must be explicitly closed by the server after the drain timeout, at which point the client is expected to reconnect to a healthy backend. To minimize disruption the server can send a graceful close message over the application protocol (e.g., a JSON control frame or gRPC GOAWAY) giving the client time to finish its current operation before the TCP connection is torn down.”
}
},
{
“@type”: “Question”,
“name”: “How is drain timeout configured?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Drain timeout is set to a value slightly greater than the 99th-percentile request latency of the slowest expected request type, ensuring that nearly all in-flight requests complete before the server is forcibly shut down. Values typically range from a few seconds for stateless HTTP APIs to several minutes for batch-processing or long-polling workloads, and are tuned by examining request duration histograms in production.”
}
}
]
}