Low Level Design: Database Connection Pool

A database connection pool is one of the most impactful infrastructure components in any application that talks to a relational database. Poor pool configuration causes connection wait latency, connection exhaustion under load, and cascading failures. This design covers the full lifecycle from pool architecture to circuit breaker integration.

Connection Pool Architecture

A connection pool maintains N pre-established, authenticated connections to the database server. When application code needs to execute a query, it borrows a connection from the pool, uses it, and returns it when done. The pool handles the underlying TCP connection and authentication handshake upfront at startup, so subsequent borrows are nearly instantaneous. Without pooling, each query incurs TCP connection setup (~1ms LAN, ~50ms WAN) plus database authentication (~5–50ms depending on auth method and server load). At 1000 requests/second, eliminating this overhead saves substantial CPU and latency. The pool is implemented as a blocking queue: borrow blocks if no connection is available (up to a configurable timeout), return adds the connection back to the queue. Thread safety is achieved via synchronized access to the queue or lock-free CAS operations on the connection state.

Pool Sizing

Pool sizing has two failure modes: too small causes connection wait (threads block waiting for a connection), too large wastes database server resources (each idle connection consumes memory and a file descriptor on the DB server). The HikariCP research (based on queuing theory and PostgreSQL benchmarks) concludes that optimal pool size per application instance is: (core_count * 2) + effective_spindle_count. For a 4-core app server with SSD storage (spindle count = 1), that’s 9 connections per instance. If you run 10 application instances, the database server sees 90 connections — well within PostgreSQL’s default max_connections = 100. The formula accounts for I/O wait time: while one thread waits for disk I/O, another can use the CPU, so you can support more concurrent queries than cores. min_connections sets the pre-warmed baseline to avoid cold-start latency on the first requests; setting it equal to max_connections is common for steady-state services.

Connection Lifecycle

Connections degrade over time. Database servers close idle connections after a server-side timeout (wait_timeout in MySQL, tcp_keepalives_idle in PostgreSQL). If the pool doesn’t know about these server-side closures, it hands out a dead connection and the application sees a broken pipe error. The pool handles this via maxLifetime: connections are proactively retired and replaced after a maximum age (e.g., 30 minutes), shorter than the server-side timeout, ensuring the pool never holds connections the server has already closed. On borrow, a validation query (e.g., SELECT 1) or JDBC isValid() call confirms the connection is alive — but this adds latency to every borrow, so it’s only used when the pool lacks a better liveness signal (like TCP keep-alive). When a query throws a SQL exception indicating connection-level failure (as opposed to a query-level error), the pool marks that connection as invalid, closes it, and opens a replacement rather than returning it to the pool.

Health Checking

Background health checks run independently of application borrows. A dedicated thread wakes every N seconds (configurable, typically 30–60s), iterates over idle connections, and validates each one with a lightweight ping. Connections failing validation are removed from the pool and replaced with fresh connections. This proactive approach prevents the pool from silently accumulating stale connections that would fail at borrow time and surface as errors to the application. The health check uses the most lightweight available mechanism: JDBC Connection.isValid(timeoutSeconds) for drivers that implement it efficiently, or a driver-specific ping command (/* ping */ for MySQL Connector/J). Logging health check results with connection ID and failure reason provides observability into connection churn, which can indicate network instability or database server issues before they cause visible application errors.

Read Replica Routing

Separating read and write traffic is a common scaling pattern. The application maintains two pools: a primary pool pointing to the read-write primary, and a replica pool pointing to one or more read replicas. Write transactions and transactions requiring read-your-writes consistency use the primary pool. Read-only transactions use the replica pool, distributing read load horizontally. In Java Spring, @Transactional(readOnly=true) annotations can trigger automatic routing to the replica pool via a routing DataSource implementation. The replica pool is sized independently — often larger than the primary pool since read queries dominate. Replication lag monitoring is critical: if the replica is significantly behind the primary (e.g., lag > 5 seconds), a fallback mechanism routes reads to the primary temporarily to avoid serving stale data. Lag is measured by comparing primary LSN (log sequence number) to replica LSN, available via pg_stat_replication in PostgreSQL.

Slow Query Detection

The connection pool is a natural interception point for query performance monitoring. The pool wraps returned connections with a proxy (e.g., via JDBC Connection interface) that records the start time before query execution and the end time after. Queries exceeding a configurable threshold (e.g., 500ms) are logged with the full SQL text, elapsed time, and stack trace to identify the calling code. Raw SQL logging is supplemented by query fingerprinting: parameter values are replaced with placeholders (WHERE id = ?) to group distinct queries with different parameters into the same pattern. Per-pattern statistics (count, p50/p95/p99 latency, total time) are aggregated in-process and exported to a metrics system (Prometheus, Datadog). This identifies the specific query patterns responsible for database load without requiring access to the database’s own slow query log, and works even when the database is a managed service (RDS, Cloud SQL) with restricted access to server logs.

Multi-Tenant Connection Pooling

Multi-tenant systems add complexity: tenants must be isolated but sharing a pool per tenant is resource-intensive. Three patterns exist. Schema-per-tenant with a shared pool: all tenants share one pool, queries set the search path (SET search_path TO tenant_123) before execution. Simple but requires careful session state cleanup on connection return. Pool-per-tenant: each tenant gets a dedicated pool. Strong isolation, but with 1000 tenants at min_connections=2 each, that’s 2000 persistent connections — unmanageable without connection multiplexing middleware. PgBouncer as a connection multiplexer sits between application and PostgreSQL, maintaining a smaller number of server-side connections and multiplexing many client connections onto them. In transaction-mode pooling, PgBouncer can serve 10,000+ application connections through 100 server connections, reducing PostgreSQL connection overhead drastically. The tradeoff: session-level features (prepared statements, advisory locks, SET commands) don’t work in transaction mode.

Circuit Breaker Integration

A database outage without a circuit breaker causes thread exhaustion: all application threads pile up waiting for connections that never come, consuming memory and preventing recovery. The circuit breaker pattern wraps pool borrow operations with failure counting logic. Three states: CLOSED (normal operation), OPEN (fast-fail all requests), HALF-OPEN (probe with limited traffic). Consecutive connection failures (e.g., 5 failures in 10 seconds) trip the breaker to OPEN state. In OPEN state, borrow attempts immediately throw an exception rather than waiting for the pool timeout — this frees threads instantly and allows the application to serve degraded responses (cached data, error messages) rather than hanging. After a cooldown period (e.g., 30 seconds), the breaker transitions to HALF-OPEN and allows a single probe request. If the probe succeeds, the breaker resets to CLOSED; if it fails, it returns to OPEN. Libraries like Resilience4j implement this pattern with configurable thresholds and integrate cleanly with Spring Boot and connection pool configuration.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the formula for optimal connection pool size and what are HikariCP’s defaults?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “The widely cited formula from the HikariCP documentation (derived from the PostgreSQL wiki) is: pool size = (core count u00d7 2) + effective_spindle_count. For a 4-core server with SSD storage (spindle count u2248 1), that yields a pool of 9. This counterintuitively small number reflects the fact that more connections than CPU cores causes context-switch overhead and lock contention that reduces throughput. HikariCP defaults: maximumPoolSize=10, minimumIdle=10 (i.e., a fixed pool), connectionTimeout=30000ms, idleTimeout=600000ms, maxLifetime=1800000ms. In practice you should benchmark your specific workload—I/O-heavy queries with long wait times can justify larger pools because threads block on network/disk rather than CPU.” } }, { “@type”: “Question”, “name”: “How do you detect and resolve connection pool exhaustion?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Detection: instrument pool metrics (active connections, pending waiters, acquisition time p99) via JMX or Micrometer and alert when pending waiters > 0 for more than a few seconds, or when connectionTimeout exceptions appear in logs. HikariCP logs a warning with a thread dump when a connection cannot be acquired within half the connectionTimeout. Resolution paths: (1) increase pool size if CPU headroom exists; (2) identify and fix slow queries holding connections too long; (3) add a read replica and route read traffic there to reduce primary pool pressure; (4) introduce a connection proxy layer (PgBouncer, ProxySQL) for transaction-mode pooling that multiplexes many app threads onto fewer backend connections; (5) implement circuit breakers so a downstream slowdown doesn’t cascade into exhaustion.” } }, { “@type”: “Question”, “name”: “How do you route read replica traffic from within a connection pool?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “The cleanest approach is to maintain two separate pool instances—one pointing at the primary and one at the replica set—and expose them as named DataSource beans (e.g., primaryDataSource and readDataSource). Application code or a routing DataSource wrapper (Spring’s AbstractRoutingDataSource) selects the appropriate pool based on transaction read-only flag or an explicit annotation (@Transactional(readOnly=true)). Alternatively, a proxy layer like ProxySQL or AWS RDS Proxy can perform read/write splitting transparently at the network level without application changes. Key concerns: replica lag means reads may return stale data; you must decide per use-case whether eventual consistency is acceptable, and fall back to primary for reads that require strong consistency immediately after a write.” } }, { “@type”: “Question”, “name”: “How do you implement slow query detection and logging via a connection pool wrapper?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Wrap the JDBC Connection or DataSource with a proxy (P6Spy, datasource-proxy, or a custom InvocationHandler) that records the wall-clock time around each Statement.execute() call. If execution time exceeds a configurable threshold (e.g., 500ms), log the full SQL, bound parameters, execution time, and calling thread/stack trace. This avoids relying solely on database-side slow query logs, which may not capture client-perceived latency including network time. In HikariCP you can set leakDetectionThreshold to log connections held open longer than a threshold, which catches cases where application code forgets to close a connection or holds it across slow external calls. Combine with distributed tracing (OpenTelemetry) to correlate slow queries with upstream request traces.” } }, { “@type”: “Question”, “name”: “What strategies handle multi-tenant connection pooling efficiently?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Three main patterns: (1) Pool-per-tenant: each tenant gets a dedicated pool pointing at their schema or database. Strong isolation, easy per-tenant limits, but memory overhead grows linearly with tenant count—practical only for hundreds of tenants. (2) Shared pool with schema switching: a single pool connects to a shared database; on borrow, the connection runs SET search_path or USE to switch tenant context. Lower overhead but schema-switch bugs can leak data between tenants, and schema switching adds latency per acquisition. (3) Proxy-based pooling (PgBouncer per tenant, or a smart proxy that maps tenant ID to backend): combines multiplexing efficiency with isolation. For large SaaS platforms, pattern 3 or a tiered approach (dedicated pools for large tenants, shared pool for small ones) is typical.” } } ] }

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top