Question 1

Why is the saga pattern necessary for multi-step tenant onboarding?

Accepted Answer

Tenant onboarding touches multiple systems: create DB schema, provision S3 bucket, create Stripe subscription, send welcome email. If step 3 (Stripe) fails after step 1 and 2 have succeeded, you cannot just retry from the beginning — you'd create a duplicate DB schema and S3 bucket. A distributed transaction (2PC) could roll back all steps atomically, but Stripe and S3 do not participate in distributed transactions. The saga pattern models each step as an independent operation with a compensation (undo) operation. On failure at step N, execute compensations N-1 through 1 in reverse order to restore a clean state. Each step is idempotent — retrying a successful step is safe. Compensation steps are also idempotent — deleting an already-deleted S3 bucket returns 404 and is treated as success.

Question 2

How do you checkpoint a saga so it resumes after a process crash mid-onboarding?

Accepted Answer

Persist the saga state to a TenantOnboardingJob table: (tenant_id, status, completed_steps JSONB, failed_step, current_attempt). completed_steps is an array of step names that have been successfully executed: ["create_db_schema", "provision_s3_bucket"]. On process restart, load the job, find the first step not in completed_steps, and continue from there. Each step must be idempotent — if create_db_schema is attempted twice (once before crash, once after resume), the second attempt uses CREATE SCHEMA IF NOT EXISTS and is a no-op. Update completed_steps in a DB transaction immediately after each step succeeds. This single-row checkpoint is the source of truth; the saga coordinator consults it on every resume.

Question 3

How do you handle partial failures where compensation also fails?

Accepted Answer

Compensation failures (e.g., S3 bucket delete fails with 503) leave the system in an inconsistent state. These are called "stuck sagas." Handling: (1) Retry compensation with exponential backoff — most 503s resolve within minutes. (2) Set saga status to "compensation_failed" after N retries and alert on-call. (3) Maintain a dead-letter queue for stuck compensations; an operator reviews and manually resolves. (4) Design compensations to be as reliable as possible — use idempotent deletes, handle 404 as success, and avoid compensations that depend on transient external state. For Stripe: Stripe subscriptions can always be cancelled even if creation partially failed (just query by metadata.tenant_id).

Question 4

What is the difference between a choreography-based and orchestration-based saga?

Accepted Answer

Choreography: each service listens for events and reacts. Service A publishes OrderPlaced; Service B listens, processes, and publishes OrderApproved; Service C listens to that, etc. No central coordinator. Advantages: loose coupling. Disadvantages: workflow logic is distributed across services — hard to trace, debug, or change the sequence. Orchestration: a central saga orchestrator (the TenantOnboardingJob runner) calls each service step in sequence and handles failures. All workflow logic lives in one place. Easier to trace, test, and modify. For tenant onboarding, orchestration is better: the sequence is fixed, steps are interdependent, and a central retry/compensation controller is easier to reason about than a distributed choreography.

Question 5

How do you test saga compensation logic without hitting real external services?

Accepted Answer

Use dependency injection and a test double for each external client. In tests, configure the mock to fail at a specific step: mock_stripe.create_subscription = raise StripeError("card_declined"). Assert that after run_onboarding() returns a failure: (1) compensation was called for all completed steps in reverse order; (2) TenantOnboardingJob.status == "compensated"; (3) DB schema for the tenant was dropped; (4) S3 bucket was deleted. Write one test per failure scenario: fail at step 1, step 2, step 3, ..., fail at step N but compensation also fails. This table-driven test coverage prevents regressions when adding new saga steps.

Tenant Onboarding Low-Level Design: Saga, Provisioning, and Rollback

Core Data Model

Onboarding Saga Orchestrator

Individual Step Implementations

Key Interview Points