Database Connection Pool Low-Level Design
A database connection pool maintains a set of pre-established connections to a database server, lending them to application threads on demand and returning them to the pool when work is complete. Creating a new database connection involves TCP handshake, authentication, and session setup — typically 10-100 ms. Pooling amortizes this cost across thousands of queries per second.
Pool Lifecycle and Sizing
min_size is the number of connections maintained in the idle queue even during low traffic. Keeping warm connections ready eliminates cold-start latency for burst traffic.
max_size is the hard cap on total connections (idle + active). It must not exceed the database server's max_connections parameter. Exceeding this causes connection refusal at the database level.
A practical sizing formula: max_size = (core_count * 2) + effective_spindle_count for Postgres. For most cloud databases, max_size between 10 and 100 per application instance is appropriate; horizontal scaling of application instances requires a connection pooler (PgBouncer, ProxySQL) to aggregate connection counts.
Checkout Flow
When a thread calls acquire():
- If the idle queue is non-empty, pop a connection and return it immediately.
- If total connections are below max_size, create a new connection, add it to the pool's tracking set, and return it.
- If at max_size, block on a semaphore with a checkout_timeout_ms deadline. If a connection is returned before the timeout, acquire it. If the timeout expires, raise a pool exhaustion exception.
Return Flow
When a thread calls release(conn):
- Run a pre-return health check (lightweight
SELECT 1). If it fails, discard the connection and decrement the pool count. - If the connection is healthy and total active waiters are queued, hand the connection directly to the next waiter.
- Otherwise, push the connection to the idle queue.
Overflow Handling
When the pool is at max_size and all connections are active, new acquire() calls can either fail-fast (raise immediately) or queue with a timeout. Queuing is preferable for bursty workloads; fail-fast is preferable when the caller has retry logic and partial failure is acceptable. A queue depth limit prevents unbounded memory growth under sustained overload.
Idle Connection Eviction
Connections idle beyond max_idle_seconds are closed and removed from the pool, down to min_size. This prevents the database from accumulating stale idle connections from application instances that have scaled down. A background eviction thread runs periodically (e.g., every 30 seconds) to close expired idle connections.
Connection Max Lifetime
Long-lived connections accumulate server-side state (temporary tables, session variables, statistics) and may hit database-side idle timeouts. A max_lifetime_seconds limit forces connections to be closed and replaced after a maximum age, regardless of health. Replacement is staggered (randomized jitter on the lifetime) to prevent thundering herd reconnects.
Health Validation Before Checkout
A pre-checkout SELECT 1 detects connections that were silently dropped by the database or network (TCP RST, firewall idle timeout). The overhead is minimal (sub-millisecond on a local socket) compared to the cost of a failed query on a broken connection. Validation can be skipped for connections that were used within the last N seconds (validation_interval) to reduce overhead on hot paths.
SQL Schema
CREATE TABLE ConnectionPoolConfig (
pool_name TEXT PRIMARY KEY,
min_size INTEGER NOT NULL DEFAULT 2,
max_size INTEGER NOT NULL DEFAULT 10,
checkout_timeout_ms INTEGER NOT NULL DEFAULT 5000,
max_idle_seconds INTEGER NOT NULL DEFAULT 300,
max_lifetime_seconds INTEGER NOT NULL DEFAULT 3600
);
CREATE TABLE ConnectionPoolMetric (
id BIGSERIAL PRIMARY KEY,
pool_name TEXT NOT NULL,
idle_count INTEGER NOT NULL,
active_count INTEGER NOT NULL,
wait_count INTEGER NOT NULL,
sampled_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON ConnectionPoolMetric (pool_name, sampled_at DESC);
Python Implementation
import threading, time, queue
from dataclasses import dataclass, field
from typing import Optional, Any
@dataclass
class PooledConnection:
conn: Any
created_at: float = field(default_factory=time.monotonic)
last_used_at: float = field(default_factory=time.monotonic)
class ConnectionPool:
def __init__(self, connect_fn, pool_name="default", min_size=2, max_size=10,
checkout_timeout_ms=5000, max_idle_seconds=300,
max_lifetime_seconds=3600, validation_interval_s=30):
self.connect_fn = connect_fn
self.pool_name = pool_name
self.min_size = min_size
self.max_size = max_size
self.checkout_timeout_s = checkout_timeout_ms / 1000.0
self.max_idle_seconds = max_idle_seconds
self.max_lifetime_seconds = max_lifetime_seconds
self.validation_interval_s = validation_interval_s
self._idle: queue.Queue = queue.Queue()
self._lock = threading.Lock()
self._total = 0
self._waiters = 0
self._semaphore = threading.Semaphore(max_size)
self._prefill()
def _prefill(self):
for _ in range(self.min_size):
pc = PooledConnection(conn=self.connect_fn())
with self._lock:
self._total += 1
self._idle.put(pc)
def acquire(self) -> PooledConnection:
acquired = self._semaphore.acquire(timeout=self.checkout_timeout_s)
if not acquired:
raise TimeoutError(f"Pool {self.pool_name} exhausted after {self.checkout_timeout_s}s")
try:
pc = self._idle.get_nowait()
if not self.validate_connection(pc):
with self._lock:
self._total -= 1
pc = PooledConnection(conn=self.connect_fn())
with self._lock:
self._total += 1
pc.last_used_at = time.monotonic()
return pc
except queue.Empty:
with self._lock:
self._total += 1
pc = PooledConnection(conn=self.connect_fn())
return pc
def release(self, pc: PooledConnection):
now = time.monotonic()
if now - pc.created_at > self.max_lifetime_seconds:
self._discard(pc)
return
if not self.validate_connection(pc):
self._discard(pc)
return
pc.last_used_at = now
self._idle.put(pc)
self._semaphore.release()
def _discard(self, pc: PooledConnection):
try:
pc.conn.close()
except Exception:
pass
with self._lock:
self._total -= 1
self._semaphore.release()
def validate_connection(self, pc: PooledConnection) -> bool:
now = time.monotonic()
if now - pc.last_used_at < self.validation_interval_s:
return True
try:
cur = pc.conn.cursor()
cur.execute("SELECT 1")
cur.close()
return True
except Exception:
return False
def evict_idle_connections(self):
now = time.monotonic()
to_evict = []
temp = []
while True:
try:
pc = self._idle.get_nowait()
temp.append(pc)
except queue.Empty:
break
for pc in temp:
idle_seconds = now - pc.last_used_at
with self._lock:
below_min = self._total self.max_idle_seconds and not below_min:
to_evict.append(pc)
else:
self._idle.put(pc)
for pc in to_evict:
self._discard(pc)
print(f"Evicted {len(to_evict)} idle connections from pool {self.pool_name}")
Frequently Asked Questions
How do you size a connection pool for a Postgres database?
The widely cited formula is max_connections = (number_of_cores * 2) + effective_spindle_count. For a 4-core SSD-backed instance, that is approximately 9 connections per application instance. The formula reflects that Postgres uses one backend process per connection, and context-switching overhead dominates beyond 2x core count. With many application instances, a connection pooler (PgBouncer in transaction mode) is required to aggregate connections before they reach the database.
What happens when the pool is exhausted and checkout_timeout_ms expires?
The acquire() call raises a pool exhaustion exception. The calling thread should propagate this as an HTTP 503 Service Unavailable or retry with backoff depending on the context. Allowing acquire() to block indefinitely risks thread starvation: all threads wait on the pool, the pool can never return connections because no thread completes work, and the application deadlocks.
Why is max_lifetime important for long-running applications?
Database servers impose their own connection limits and may forcibly close idle connections (wait_timeout in MySQL, idle_in_transaction_session_timeout in Postgres). A max_lifetime ensures the pool proactively replaces connections before the server terminates them, preventing silent broken-pipe errors. It also forces periodic reconnection, which clears accumulated session state and picks up server-side configuration changes.
What is the overhead of pre-checkout health validation?
A SELECT 1 over a local Unix socket or loopback TCP connection takes under 1 ms. For queries that themselves take 10-100 ms, validation overhead is less than 1%. Validation can be skipped for recently used connections (within validation_interval_s) to eliminate overhead on hot code paths where connections are returned and reacquired rapidly.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you size a connection pool for a Postgres database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The widely cited formula is max_connections = (number_of_cores * 2) + effective_spindle_count. For a 4-core SSD-backed instance, that is approximately 9 connections per application instance. The formula reflects that Postgres uses one backend process per connection, and context-switching overhead dominates beyond 2x core count. With many application instances, a connection pooler (PgBouncer in transaction mode) is required to aggregate connections before they reach the database.”
}
},
{
“@type”: “Question”,
“name”: “What happens when the pool is exhausted and checkout_timeout_ms expires?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The acquire() call raises a pool exhaustion exception. The calling thread should propagate this as an HTTP 503 Service Unavailable or retry with backoff depending on the context. Allowing acquire() to block indefinitely risks thread starvation: all threads wait on the pool, the pool can never return connections because no thread completes work, and the application deadlocks.”
}
},
{
“@type”: “Question”,
“name”: “Why is max_lifetime important for long-running applications?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Database servers impose their own connection limits and may forcibly close idle connections. A max_lifetime ensures the pool proactively replaces connections before the server terminates them, preventing silent broken-pipe errors. It also forces periodic reconnection, which clears accumulated session state and picks up server-side configuration changes.”
}
},
{
“@type”: “Question”,
“name”: “What is the overhead of pre-checkout health validation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A SELECT 1 over a local Unix socket or loopback TCP connection takes under 1 ms. For queries that themselves take 10-100 ms, validation overhead is less than 1%. Validation can be skipped for recently used connections (within validation_interval_s) to eliminate overhead on hot code paths where connections are returned and reacquired rapidly.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is connection pool size calculated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A common formula is min_size = expected_concurrent_queries and max_size = (database_max_connections / number_of_app_instances) – buffer; oversizing wastes DB connections, undersizing causes checkout timeouts under load.”
}
},
{
“@type”: “Question”,
“name”: “How are idle connections evicted to prevent stale state?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A background thread periodically checks connections idle beyond max_idle_seconds and closes them; the pool shrinks toward min_size, never below it.”
}
},
{
“@type”: “Question”,
“name”: “How does pre-checkout health validation work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before returning a connection from the idle queue, the pool issues a lightweight SELECT 1 (or equivalent ping); a failed validation discards the connection and retries with a fresh one.”
}
},
{
“@type”: “Question”,
“name”: “How is the checkout timeout implemented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The acquire() method waits on a semaphore or condition variable with a timeout; if no connection becomes available within checkout_timeout_ms, a PoolExhausted exception is raised.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide