Low Level Design: Tenant Provisioning Service

Q: What is a tenant provisioning service in system design?

A tenant provisioning service automates end-to-end onboarding of new SaaS customers: creating database schemas, IAM resources, billing subscriptions, feature flag defaults, and integrations. It also handles offboarding with data export and permanent deletion, using idempotent steps in a durable workflow engine.

Q: What are the steps in a tenant onboarding workflow?

Validate input, allocate tenant ID, provision data store and run migrations, create IAM roles and admin user, configure billing, apply feature flags, provision integrations, send welcome email, then mark tenant ACTIVE. Each step is idempotent with persisted completion state so a crash resumes from the last checkpoint.

Q: What are the tradeoffs between shared schema, schema-per-tenant, and DB-per-tenant multi-tenancy?

Shared schema: lowest cost, highest blast-radius risk and cross-tenant data leak potential. Schema-per-tenant: strong logical isolation, connection pool pressure at scale. DB-per-tenant: strongest isolation and compliance, highest cost and operational complexity. Most SaaS platforms use shared schema for SMB and DB/schema-per-tenant for enterprise tiers.

Q: How do you design tenant offboarding and data deletion?

Soft-delete the tenant immediately (SUSPENDED, block API calls), offer a data export window, cancel billing, then hard-delete all data (schema, buckets, search indices) via a tenant.deleted event after the grace period. Retain a minimal tombstone record for compliance. Use a workflow engine to ensure all deletion steps complete.

⏱ 8 min read

In a SaaS product, onboarding a new customer involves far more than creating a database row. The tenant provisioning service orchestrates every resource allocation step — database schema creation, object storage bucket setup, compute namespace assignment, DNS configuration, and default config seeding — in a reliable, repeatable workflow that handles partial failures gracefully.

Tenant Schema

CREATE TABLE tenants (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  slug        VARCHAR(63) NOT NULL UNIQUE,  -- used as subdomain: {slug}.app.example.com
  plan        ENUM('starter','growth','enterprise') NOT NULL DEFAULT 'starter',
  status      ENUM('pending','provisioning','active','suspended','deprovisioned') NOT NULL DEFAULT 'pending',
  region      VARCHAR(32) NOT NULL DEFAULT 'us-east-1',
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  suspended_at TIMESTAMPTZ,
  deprovisioned_at TIMESTAMPTZ
);

Provisioning State Machine

Tenant lifecycle follows a strict state machine to make status unambiguous and to give the provisioning workflow a clear re-entry point on failure:

pending — record created, provisioning not yet started (e.g. waiting for payment confirmation).
provisioning — active provisioning workflow is running. If the service crashes mid-flight, the workflow resumes from the last completed step on restart.
active — all resources provisioned, tenant can log in and use the product.
suspended — soft block: data retained, all API requests return HTTP 402 or 403, tenant cannot log in. Triggered by payment failure or policy violation.
deprovisioned — permanent end state after offboarding workflow completes.

Provisioning Workflow Steps

The provisioning workflow executes these steps in order. Each step is idempotent — re-running a completed step is a no-op — so the workflow engine can safely retry from the last failed step:

Create database schema or namespace. In a shared-schema model: add the tenant_id to the tenants table and rely on row-level filters. In a schema-per-tenant model: CREATE SCHEMA tenant_{slug} and run migrations against it. In a DB-per-tenant model: provision a new database instance (typically reserved for enterprise plans).
Provision S3 bucket. Create a bucket named tenant-{id}-{region}, apply a bucket policy scoped to the tenant IAM role, enable server-side encryption, and configure lifecycle rules per the plan quota.
Allocate Kubernetes namespace. Create namespace tenant-{slug}, apply resource quotas (CPU, memory limits from the plan), create a service account, and bind the appropriate RBAC roles.
Configure DNS subdomain. Create a CNAME record {slug}.app.example.com pointing to the shared ingress. Verify propagation before marking the step complete.
Seed default configuration. Write plan-appropriate feature flags, rate limits, and default settings to the config service under the namespace tenant.{id}.
Transition to active. Update status = 'active', record activated_at, send welcome email, emit a tenant.provisioned event to the event bus for downstream consumers (analytics, CRM sync).

Quota Management

Each plan defines hard quotas stored in a plan_quotas table: storage bytes, API calls per month, and seat count. A quota service tracks consumption in real time using Redis counters (incremented on every API call, reset on the billing cycle boundary). When consumption crosses 80% of a quota, a warning notification is sent. At 100%, the relevant API surface returns HTTP 429 or rejects new uploads until the tenant upgrades or the cycle resets.

Tenant Suspension

Suspension is a soft block: all tenant data is retained and all infrastructure remains allocated. The suspension record sets suspended_at and the authentication middleware checks the tenant status on every login and API request. Suspension is reversible instantly — a billing event or admin action sets status = 'active' again with no reprovisioning required.

Offboarding Workflow

Offboarding follows a three-phase process to balance data safety with resource reclamation:

Data export. Generate a full export of the tenant data (database dump, S3 contents) and make a signed download URL available to the account owner for 30 days. Set status = 'suspended' immediately so no new data can be written.
30-day hold. All resources remain allocated but inaccessible. This window allows the tenant to reconsider or retrieve their export. A scheduled job checks daily for holds that have expired.
Permanent deletion. After the hold period: drop the database schema or terminate the database instance, delete the S3 bucket, delete the Kubernetes namespace, remove the DNS record, purge config entries. Set status = 'deprovisioned' and deprovisioned_at = now(). The tenant row itself is retained permanently for billing and audit records.

Multi-Tenant Isolation Strategies

The right isolation model depends on the number of tenants and their security requirements:

Shared schema with tenant_id filter. All tenants share the same tables. Every query must include WHERE tenant_id = ?. Cheapest to operate, highest risk of a missing filter leaking cross-tenant data. Mitigated by row-level security (RLS) enforced in the database, not the application layer.
Schema-per-tenant. Each tenant gets a dedicated PostgreSQL schema. Migrations must be applied across all schemas. Provides strong isolation with moderate operational overhead. Good up to ~1,000 tenants per database instance.
DB-per-tenant. Each tenant gets a dedicated database instance. Maximum isolation, suitable for enterprise customers with compliance requirements. High cost — typically reserved for high-ARPU tenants on enterprise plans.

Interview Talking Points

How do you handle a provisioning step failing halfway? Use a saga pattern with compensating transactions — if step 3 (Kubernetes namespace) fails, the compensating transaction deletes the S3 bucket created in step 2 and the DB schema from step 1, then sets status back to pending for retry.
How do you enforce tenant isolation at the query layer? PostgreSQL row-level security policies tied to the application role set with SET app.current_tenant_id = ? at connection time. The database enforces the filter even if the application forgets to include it.
How do you scale to 100,000 tenants? Shared schema with RLS for the long tail; offer schema-per-tenant or DB-per-tenant as paid upgrades for compliance-sensitive customers.

Frequently Asked Questions

What is a tenant provisioning service in system design?

A tenant provisioning service automates the end-to-end onboarding of new customers (tenants) onto a multi-tenant SaaS platform. It orchestrates the creation and configuration of all resources a tenant needs: a unique tenant identifier, isolated or shared database schema, default roles and permissions, initial admin user account, billing subscription, feature flag defaults, and any third-party integrations. It also handles offboarding: graceful data export, resource teardown, and permanent deletion on request. Design goals include idempotent operations (safe to retry on partial failure), a durable workflow engine to track multi-step progress, clear tenant isolation boundaries, and compliance with data residency and deletion requirements.

What are the steps in a tenant onboarding workflow?

A typical onboarding workflow orchestrated as a durable saga or step function: (1) Validate input — check plan, billing info, and domain uniqueness; (2) Allocate tenant ID — generate a globally unique opaque identifier (UUID or Snowflake ID); (3) Provision data store — create schema, database, or namespace depending on isolation model; run migrations; seed default data; (4) Create IAM resources — set up tenant-scoped roles, permission groups, and an initial admin user with a temporary password or invite link; (5) Configure billing — create a subscription in Stripe/billing system tied to the tenant ID and selected plan; (6) Apply feature flags — set plan-appropriate feature flag defaults in the flag service; (7) Provision integrations — create any tenant-specific queues, buckets, or external service accounts; (8) Send welcome email — notify the admin with login instructions; (9) Mark tenant active — flip status to ACTIVE in the tenant registry. Each step is idempotent and its completion is persisted, so a crash at any step resumes from the last completed checkpoint.

What are the tradeoffs between shared schema, schema-per-tenant, and DB-per-tenant multi-tenancy?

Shared schema (all tenants in one set of tables, with a tenant_id column): lowest infrastructure cost and operational overhead; simple to migrate (one schema change affects all tenants); but highest blast-radius risk (a bug can leak cross-tenant data if tenant_id filters are omitted), noisy-neighbor query contention, and harder to meet regulatory isolation requirements. Schema-per-tenant (each tenant has their own schema within a shared database cluster): strong logical isolation, easier per-tenant migrations and backups, but connection pool pressure grows with tenant count and schema proliferation complicates DDL tooling. DB-per-tenant (each tenant has a dedicated database instance): strongest isolation (hardware, network, failure domain), easiest compliance (data residency, audit), per-tenant scaling and maintenance windows; but highest cost, most operational complexity, and slowest provisioning (spinning up a new DB instance takes minutes). Most SaaS platforms use shared schema for SMB tiers and schema-per-tenant or DB-per-tenant for enterprise customers with isolation SLAs.

How do you design tenant offboarding and data deletion?

Tenant offboarding must be safe, auditable, and compliant with GDPR’s right to erasure. The offboarding workflow: (1) Soft delete — immediately mark the tenant as SUSPENDED in the registry and block all new API requests with a 403; (2) Data export window — notify the admin and provide a self-service data export (GDPR portability) for a configurable grace period (e.g., 30 days); (3) Cancel billing — terminate the subscription and issue any prorated refunds; (4) Hard delete — after the grace period (or immediately on an erasure request), delete all tenant data: drop the schema or database, delete object storage buckets, purge CDN cache, remove from search indices, delete from all downstream systems via a tenant.deleted event; (5) Audit record — retain a minimal tombstone record (tenant ID, deletion timestamp, requestor, reason) that contains no personal data but proves deletion occurred for compliance purposes; (6) Verify deletion — a post-deletion scan confirms no tenant data remains in hot stores, with results stored in the audit log. Use a workflow engine (Temporal, AWS Step Functions) to ensure each deletion step completes or retries without manual intervention.