Tenant Onboarding Low-Level Design: Saga, Provisioning, and Rollback

Tenant onboarding provisions all the resources a new customer needs to use a multi-tenant SaaS product: creating their account, workspace, and initial user; provisioning infrastructure resources (database schema, S3 bucket prefix, Stripe customer); and configuring defaults. The key design requirement is that onboarding completes atomically — a failure halfway through leaves no orphaned resources that must be manually cleaned up. This calls for the saga pattern, with compensation for each provisioned resource.

Core Data Model

CREATE TABLE Tenant (
    tenant_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    slug            VARCHAR(100) UNIQUE NOT NULL,   -- URL-safe identifier
    name            VARCHAR(255) NOT NULL,
    plan            VARCHAR(50) NOT NULL DEFAULT 'trial',
    status          VARCHAR(20) NOT NULL DEFAULT 'provisioning',
    -- provisioning, active, suspended, cancelled
    owner_user_id   BIGINT,
    stripe_customer_id VARCHAR(100),
    s3_prefix       VARCHAR(200),
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    activated_at    TIMESTAMPTZ
);

CREATE TABLE TenantOnboardingJob (
    job_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL REFERENCES Tenant(tenant_id),
    status          VARCHAR(20) NOT NULL DEFAULT 'pending',
    -- pending, running, completed, failed, compensating, rolled_back
    current_step    VARCHAR(50),
    completed_steps JSONB NOT NULL DEFAULT '[]',
    error_message   TEXT,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

Onboarding Saga Orchestrator

ONBOARDING_STEPS = [
    'create_db_schema',
    'create_s3_prefix',
    'create_stripe_customer',
    'seed_default_data',
    'send_welcome_email',
    'activate_tenant',
]

COMPENSATION = {
    'create_db_schema': 'drop_db_schema',
    'create_s3_prefix': 'delete_s3_prefix',
    'create_stripe_customer': 'archive_stripe_customer',
    'seed_default_data': 'delete_seeded_data',
    'send_welcome_email': None,        # cannot unsend an email
    'activate_tenant': 'deactivate_tenant',
}

def run_onboarding(job_id: str):
    job = db.fetchone("SELECT * FROM TenantOnboardingJob WHERE job_id=%s", [job_id])
    tenant = db.fetchone("SELECT * FROM Tenant WHERE tenant_id=%s", [job['tenant_id']])

    completed = set(job['completed_steps'])

    for step in ONBOARDING_STEPS:
        if step in completed:
            continue  # already done — resume from here

        db.execute("""
            UPDATE TenantOnboardingJob SET current_step=%s, status='running'
            WHERE job_id=%s
        """, [step, job_id])

        try:
            result = execute_step(step, tenant)
            # Save step output for potential compensation
            completed.add(step)
            db.execute("""
                UPDATE TenantOnboardingJob
                SET completed_steps = completed_steps || %s::jsonb
                WHERE job_id=%s
            """, [json.dumps([step]), job_id])
            # Apply side effects to tenant record
            apply_step_result(tenant['tenant_id'], step, result)
        except Exception as e:
            db.execute("""
                UPDATE TenantOnboardingJob
                SET status='failed', error_message=%s
                WHERE job_id=%s
            """, [str(e), job_id])
            compensate(job_id, tenant, list(completed))
            return

    db.execute("""
        UPDATE TenantOnboardingJob SET status='completed', completed_at=NOW()
        WHERE job_id=%s
    """, [job_id])

def compensate(job_id: str, tenant: dict, completed_steps: list):
    db.execute("UPDATE TenantOnboardingJob SET status='compensating' WHERE job_id=%s", [job_id])
    # Reverse order
    for step in reversed(completed_steps):
        comp = COMPENSATION.get(step)
        if comp:
            try:
                execute_step(comp, tenant)
            except Exception as e:
                # Compensation failure — alert ops, do not retry automatically
                alert_ops(f"Compensation failed for {step} on tenant {tenant['tenant_id']}: {e}")
    db.execute("UPDATE TenantOnboardingJob SET status='rolled_back' WHERE job_id=%s", [job_id])

Individual Step Implementations

def execute_step(step: str, tenant: dict) -> dict:
    if step == 'create_db_schema':
        schema = f"tenant_{tenant['slug'].replace('-', '_')}"
        db.execute(f"CREATE SCHEMA IF NOT EXISTS {schema}")
        db.execute(f"SET search_path TO {schema}")
        run_migrations(schema)  # apply base tables for this tenant
        return {'schema_name': schema}

    if step == 'create_s3_prefix':
        prefix = f"tenants/{tenant['tenant_id']}/"
        # Create a "folder" by uploading a zero-byte marker
        s3.put_object(Bucket=S3_BUCKET, Key=f"{prefix}.keep", Body=b'')
        return {'s3_prefix': prefix}

    if step == 'create_stripe_customer':
        customer = stripe.Customer.create(
            name=tenant['name'],
            metadata={'tenant_id': str(tenant['tenant_id'])}
        )
        return {'stripe_customer_id': customer.id}

    if step == 'activate_tenant':
        db.execute("""
            UPDATE Tenant SET status='active', activated_at=NOW()
            WHERE tenant_id=%s
        """, [tenant['tenant_id']])
        return {}

Key Interview Points

The saga pattern is essential here — onboarding touches DB, S3, Stripe, and email. No single transaction can span these systems; each step must have a compensating action.
Idempotency at each step: CREATE SCHEMA IF NOT EXISTS and S3 put_object are naturally idempotent — safe to retry. Stripe customer creation requires an idempotency key header to prevent duplicate customers on retry.
Compensation cannot unsend emails or undo notifications — design the onboarding sequence so irreversible steps (email, notifications) come last, after all reversible infrastructure steps succeed.
Resumability: the completed_steps JSON array checkpoints progress. If the job worker crashes mid-onboarding, re-running the job skips already-completed steps.
Tenant schema isolation (CREATE SCHEMA per tenant) provides row-level isolation at the DB level — queries must be namespaced but cross-tenant data leakage is prevented by schema boundaries.
Async onboarding UX: return immediately from the POST /tenants endpoint with tenant_id and status=provisioning. Poll GET /tenants/{id}/status or use a webhook to notify when activation is complete.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why is the saga pattern necessary for multi-step tenant onboarding?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Tenant onboarding touches multiple systems: create DB schema, provision S3 bucket, create Stripe subscription, send welcome email. If step 3 (Stripe) fails after step 1 and 2 have succeeded, you cannot just retry from the beginning — you’d create a duplicate DB schema and S3 bucket. A distributed transaction (2PC) could roll back all steps atomically, but Stripe and S3 do not participate in distributed transactions. The saga pattern models each step as an independent operation with a compensation (undo) operation. On failure at step N, execute compensations N-1 through 1 in reverse order to restore a clean state. Each step is idempotent — retrying a successful step is safe. Compensation steps are also idempotent — deleting an already-deleted S3 bucket returns 404 and is treated as success.”}},{“@type”:”Question”,”name”:”How do you checkpoint a saga so it resumes after a process crash mid-onboarding?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Persist the saga state to a TenantOnboardingJob table: (tenant_id, status, completed_steps JSONB, failed_step, current_attempt). completed_steps is an array of step names that have been successfully executed: ["create_db_schema", "provision_s3_bucket"]. On process restart, load the job, find the first step not in completed_steps, and continue from there. Each step must be idempotent — if create_db_schema is attempted twice (once before crash, once after resume), the second attempt uses CREATE SCHEMA IF NOT EXISTS and is a no-op. Update completed_steps in a DB transaction immediately after each step succeeds. This single-row checkpoint is the source of truth; the saga coordinator consults it on every resume.”}},{“@type”:”Question”,”name”:”How do you handle partial failures where compensation also fails?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Compensation failures (e.g., S3 bucket delete fails with 503) leave the system in an inconsistent state. These are called "stuck sagas." Handling: (1) Retry compensation with exponential backoff — most 503s resolve within minutes. (2) Set saga status to "compensation_failed" after N retries and alert on-call. (3) Maintain a dead-letter queue for stuck compensations; an operator reviews and manually resolves. (4) Design compensations to be as reliable as possible — use idempotent deletes, handle 404 as success, and avoid compensations that depend on transient external state. For Stripe: Stripe subscriptions can always be cancelled even if creation partially failed (just query by metadata.tenant_id).”}},{“@type”:”Question”,”name”:”What is the difference between a choreography-based and orchestration-based saga?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Choreography: each service listens for events and reacts. Service A publishes OrderPlaced; Service B listens, processes, and publishes OrderApproved; Service C listens to that, etc. No central coordinator. Advantages: loose coupling. Disadvantages: workflow logic is distributed across services — hard to trace, debug, or change the sequence. Orchestration: a central saga orchestrator (the TenantOnboardingJob runner) calls each service step in sequence and handles failures. All workflow logic lives in one place. Easier to trace, test, and modify. For tenant onboarding, orchestration is better: the sequence is fixed, steps are interdependent, and a central retry/compensation controller is easier to reason about than a distributed choreography.”}},{“@type”:”Question”,”name”:”How do you test saga compensation logic without hitting real external services?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use dependency injection and a test double for each external client. In tests, configure the mock to fail at a specific step: mock_stripe.create_subscription = raise StripeError("card_declined"). Assert that after run_onboarding() returns a failure: (1) compensation was called for all completed steps in reverse order; (2) TenantOnboardingJob.status == "compensated"; (3) DB schema for the tenant was dropped; (4) S3 bucket was deleted. Write one test per failure scenario: fail at step 1, step 2, step 3, …, fail at step N but compensation also fails. This table-driven test coverage prevents regressions when adding new saga steps.”}}]}

Tenant onboarding and multi-step payment saga design is discussed in Stripe system design interview questions.

Tenant onboarding and SaaS multi-tenant provisioning design is covered in Atlassian system design interview preparation.

Tenant onboarding and distributed saga orchestration design is discussed in Amazon system design interview guide.