Tenant Isolation System: Low-Level Design
A multi-tenant SaaS application serves many organizations (tenants) from a shared infrastructure. Tenant isolation ensures that one tenant cannot access another tenant’s data, and that one tenant’s heavy usage does not degrade performance for others. This design covers the three isolation models (shared schema, separate schema, separate database), row-level security for shared schema, connection pool isolation, and rate limiting per tenant.
Isolation Model Comparison
"""
Model 1: Shared Schema (all tenants in same tables)
- Every table has a tenant_id column
- Lowest cost: single database, single schema
- Risk: missing WHERE tenant_id=X in one query leaks all tenants' data
- Mitigation: Postgres Row-Level Security (RLS) enforces tenant_id at DB level
- Best for: SMB SaaS, cost-sensitive products, homogeneous tenant needs
Model 2: Separate Schema (one schema per tenant, same database)
- search_path set per connection: SET search_path = tenant_abc, public
- Each tenant has their own table instances (orders, users, etc.)
- Moderate isolation: still same Postgres process
- Supports tenant-specific schema customizations
- Best for: mid-market SaaS, custom fields per tenant
Model 3: Separate Database (one DB per tenant)
- Maximum isolation: separate connection pool, separate WAL, separate backups
- Highest cost: N databases to operate, patch, and back up
- Required for: compliance (HIPAA, FedRAMP), enterprise contracts with data residency
- Best for: enterprise/regulated customers
"""
Row-Level Security (Shared Schema Model)
-- Enable RLS on every tenant-scoped table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders FORCE ROW LEVEL SECURITY; -- applies to table owners too
-- Policy: a session can only see rows matching its tenant_id
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.tenant_id')::BIGINT);
-- Same pattern for every tenant-scoped table
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON users
USING (tenant_id = current_setting('app.tenant_id')::BIGINT);
-- Application sets tenant context at connection checkout:
-- SET LOCAL app.tenant_id = '42'; (transaction-scoped)
-- All subsequent queries on this connection only see tenant 42's rows.
Application-Level Tenant Context
from contextlib import contextmanager
import threading
_local = threading.local()
@contextmanager
def tenant_context(tenant_id: int):
"""
Set the active tenant for all DB operations in this scope.
Passes tenant_id to Postgres via SET LOCAL so RLS policies apply.
"""
old = getattr(_local, 'tenant_id', None)
_local.tenant_id = tenant_id
conn = db_pool.getconn()
try:
with conn.cursor() as cur:
cur.execute("SET LOCAL app.tenant_id = %s", (str(tenant_id),))
yield conn
finally:
conn.commit()
db_pool.putconn(conn)
_local.tenant_id = old
# Usage in a request handler:
# @app.route('/api/orders')
# def list_orders():
# with tenant_context(request.tenant_id) as conn:
# orders = conn.fetchall("SELECT * FROM orders ORDER BY created_at DESC LIMIT 20")
# return jsonify(orders)
Per-Tenant Rate Limiting
import redis, time
r = redis.Redis(decode_responses=True)
TENANT_RATE_LIMITS = {
'free': {'requests_per_minute': 60, 'burst': 10},
'starter': {'requests_per_minute': 600, 'burst': 100},
'business': {'requests_per_minute': 3000, 'burst': 500},
'enterprise': {'requests_per_minute': 30000,'burst': 5000},
}
def check_rate_limit(tenant_id: int, plan: str) -> dict:
"""
Token bucket rate limiter per tenant.
Returns: {'allowed': bool, 'remaining': int, 'reset_at': float}
"""
limits = TENANT_RATE_LIMITS.get(plan, TENANT_RATE_LIMITS['free'])
rpm = limits['requests_per_minute']
burst = limits['burst']
now = time.time()
window_key = f"ratelimit:{tenant_id}:{int(now // 60)}" # 1-minute fixed window
current = r.incr(window_key)
if current == 1:
r.expire(window_key, 120) # 2-minute TTL for safety
allowed = current dict:
"""
Aggregate resource usage for a tenant across dimensions.
Used for billing and quota enforcement.
"""
return db.fetchone("""
SELECT
(SELECT COUNT(*) FROM users WHERE tenant_id=%s) AS user_count,
(SELECT COUNT(*) FROM orders WHERE tenant_id=%s) AS order_count,
(SELECT COALESCE(SUM(size_bytes),0) FROM Document WHERE tenant_id=%s) AS storage_bytes,
(SELECT api_call_count FROM TenantUsage WHERE tenant_id=%s AND period=date_trunc('month',NOW()))
AS api_calls_this_month
""", (tenant_id, tenant_id, tenant_id, tenant_id))
Connection Pool Isolation for Noisy Neighbor Prevention
"""
Problem: tenant A runs 50 concurrent slow analytical queries, exhausting the connection pool
and starving tenants B, C, D of database connections.
Solution: per-tier connection pool limits.
"""
import psycopg2.pool
POOL_ALLOCATIONS = {
'free': 2, # max 2 concurrent DB connections
'starter': 5,
'business': 20,
'enterprise': 50,
}
_tenant_pools = {} # tenant_id -> psycopg2.pool.ThreadedConnectionPool
_global_pool = psycopg2.pool.ThreadedConnectionPool(
minconn=10, maxconn=200, host='db-primary', dbname='appdb'
)
def get_tenant_conn(tenant_id: int, plan: str):
"""
Return a connection for the tenant, respecting their pool allocation.
A tenant on the 'free' plan can hold at most 2 connections simultaneously.
"""
max_conns = POOL_ALLOCATIONS.get(plan, 2)
if tenant_id not in _tenant_pools:
_tenant_pools[tenant_id] = psycopg2.pool.ThreadedConnectionPool(
minconn=1, maxconn=max_conns,
host='db-primary', dbname='appdb'
)
try:
conn = _tenant_pools[tenant_id].getconn()
except psycopg2.pool.PoolError:
raise TenantResourceExhaustedError(
f"Connection limit reached for tenant {tenant_id} on plan {plan}"
)
# Set tenant context for RLS
with conn.cursor() as cur:
cur.execute("SET LOCAL app.tenant_id = %s", (str(tenant_id),))
return conn
Key Design Decisions
- RLS as last-resort safety net, not primary isolation: application code should always include WHERE tenant_id=%s in queries. RLS is a defense-in-depth layer that catches bugs (missing tenant filter) before they become data breaches. RLS policies have a small performance cost (~5% query overhead) from the additional predicate. Use EXPLAIN ANALYZE to verify the RLS predicate is being pushed down to index scans, not applied as a post-filter on full table scans.
- SET LOCAL vs SET: SET LOCAL applies the tenant_id setting only for the duration of the current transaction. When the connection is returned to the pool and the transaction ends, the setting is reset. SET (without LOCAL) persists for the connection session — dangerous in a connection pool where connections are reused across tenants.
- Per-tier connection pool limits prevent noisy neighbor: a free-tier tenant running 50 concurrent queries cannot exhaust the global pool because their pool is capped at 2 connections. This enforces the implicit SLA of different plan tiers without requiring application-level query throttling.
- Separate schema for enterprise tenants: for customers requiring data residency (EU data must stay in EU) or strict compliance boundaries, a separate Postgres schema (or database) allows independent backups, separate encryption keys, and schema customizations without forking the application code. The application selects the schema at connection time by setting search_path.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does Postgres Row-Level Security enforce tenant isolation, and what are its limitations?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”RLS adds a predicate to every SELECT, INSERT, UPDATE, and DELETE on a table: WHERE tenant_id = current_setting(‘app.tenant_id’). This predicate fires at the storage layer — even if application code forgets WHERE tenant_id=X, RLS prevents cross-tenant data access. Enforcement is transparent to the query writer. Limitations: (1) 5–15% query overhead from the additional predicate evaluation — use EXPLAIN ANALYZE to verify index usage; (2) requires FORCE ROW LEVEL SECURITY to apply to table owners (superusers bypass RLS by default); (3) only works for tenant_id-based access patterns — complex multi-tenant policies (user can access records owned by their organization) require more sophisticated policies; (4) does not protect against cross-tenant inference attacks (timing attacks, cache side channels) — only prevents direct data access; (5) the SET LOCAL app.tenant_id call must happen before every transaction — if a connection pool reuses a connection without resetting the tenant context, the wrong tenant’s data could be exposed. Always use SET LOCAL (transaction-scoped), not SET (session-scoped).”}},{“@type”:”Question”,”name”:”How do you migrate a single-tenant database schema to multi-tenant without downtime?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Adding tenant_id to an existing single-tenant schema: (1) add tenant_id column to all tables: ALTER TABLE orders ADD COLUMN tenant_id BIGINT; — instant, non-blocking in Postgres 11+; (2) backfill: UPDATE orders SET tenant_id = $original_tenant_id; — the first customer becomes tenant_id=1; (3) add NOT NULL constraint (expand/contract pattern): ADD CONSTRAINT orders_tenant_not_null CHECK (tenant_id IS NOT NULL) NOT VALID; then VALIDATE CONSTRAINT; (4) add indexes: CREATE INDEX CONCURRENTLY ON orders(tenant_id, created_at); — non-blocking; (5) enable RLS: ALTER TABLE orders ENABLE ROW LEVEL SECURITY; — instant; (6) add policy: CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting(‘app.tenant_id’)::BIGINT); — instant. Total: 3–6 migrations, fully non-blocking except for the SET LOCAL update in application code which must deploy alongside or before step 6.”}},{“@type”:”Question”,”name”:”How do you handle tenant data that spans multiple tables (cascading deletes, cross-table queries)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In a multi-tenant schema, a tenant deletion must remove data from dozens of tables. Without cascade rules, manual DELETE statements across 30 tables are error-prone. Solutions: (1) foreign keys with ON DELETE CASCADE: customers.tenant_id FK → orders, order_items, shipments FK chains. A single DELETE FROM Tenant WHERE tenant_id=X cascades through the hierarchy. Risk: accidental tenant deletion cascades permanently. Add soft delete first (is_active=FALSE), confirm, then physically delete. (2) Dedicated tenant deletion job: enqueue a delete_tenant job that runs as a background worker, deleting tables in dependency order (children before parents), processing in batches of 1,000 rows to avoid lock storms. (3) For cross-table queries: always include tenant_id in JOINs — SELECT * FROM orders o JOIN shipments s ON s.order_id=o.order_id AND s.tenant_id=o.tenant_id WHERE o.tenant_id=%s. The tenant_id in the JOIN condition prevents RLS from needing to evaluate on each joined table independently.”}},{“@type”:”Question”,”name”:”How do you implement per-tenant feature flags and configuration overrides?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Different tenants may be on different plan tiers with different feature access, or may have negotiated custom settings. Schema: CREATE TABLE TenantConfig (tenant_id BIGINT NOT NULL, config_key VARCHAR(100) NOT NULL, config_value TEXT NOT NULL, PRIMARY KEY (tenant_id, config_key)). Feature flag evaluation: SELECT config_value FROM TenantConfig WHERE tenant_id=%s AND config_key=’feature:advanced_reporting’. If no row exists, fall back to the plan-level default. Cache tenant config in Redis (HGETALL tenant_config:{tenant_id}) with a 5-minute TTL — config reads happen on every request. Invalidate on config update: r.delete("tenant_config:{tenant_id}"). This pattern supports: plan tier features (enterprise gets SSO, free does not), beta features for specific tenants, usage limits (max_users: 50 for starter, unlimited for enterprise), and custom branding settings.”}},{“@type”:”Question”,”name”:”How do you handle a tenant that consumes 80% of your database’s query capacity?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A noisy tenant with heavy analytical workloads (bulk exports, complex reports) degrades performance for all other tenants sharing the database. Detection: SELECT tenant_id, COUNT(*) AS queries, AVG(duration_ms) AS avg_ms, SUM(duration_ms) AS total_ms FROM SlowQueryLog WHERE logged_at > NOW()-INTERVAL ‘1 hour’ GROUP BY tenant_id ORDER BY total_ms DESC LIMIT 10. If one tenant accounts for >50% of total query time: (1) immediate: route their queries to a dedicated read replica (connection pool per tenant as described); (2) rate limit their query rate via the per-tenant connection pool cap (max 2 connections for free tier); (3) async migration: move their heavy operations (bulk exports, analytics) to a background job queue that runs on a separate replica with a dedicated connection pool; (4) long-term: offer them a dedicated database tier (separate Postgres instance for their tenant) — this is the enterprise upsell path. Communicate proactively: "Your usage exceeds shared tier limits; we recommend upgrading to a dedicated plan."”}}]}
Tenant isolation and multi-tenant SaaS design is discussed in Stripe system design interview questions.
Tenant isolation and workspace isolation design is covered in Atlassian system design interview preparation.
Tenant isolation and merchant data isolation design is discussed in Shopify system design interview guide.