System Design Interview: Design a CI/CD Deployment Pipeline

System Design: Design a Code Deployment System (CI/CD Pipeline)

Designing a CI/CD (Continuous Integration / Continuous Deployment) system is asked at infrastructure-focused companies like Cloudflare, Stripe, and Atlassian. The challenge is building a reliable, fast, rollback-capable pipeline that deploys code to thousands of servers safely.

Requirements

Functional: trigger build on code push, run tests, build artifacts, deploy to staging → production, support rollback, provide deployment status and logs.

Non-functional: fast builds (target < 10 min), reliable (no partial deploys), safe (gradual rollout, auto-rollback on errors), auditable (who deployed what, when).

Pipeline Stages

Code Push (git push)
       │
       ▼ webhook
┌──────────────┐
│  CI Server   │ (GitHub Actions, Jenkins, BuildKite)
│  - clone     │
│  - install   │
│  - lint/test │
│  - build     │
└──────┬───────┘
       │ artifact (Docker image, .tar.gz)
       ▼
┌──────────────────┐
│  Artifact Store  │ (S3, ECR, Artifactory)
└──────┬───────────┘
       │
       ▼
┌──────────────────┐
│  Staging Deploy  │ (smoke tests)
└──────┬───────────┘
       │ approval gate (manual or auto)
       ▼
┌──────────────────────────────────┐
│  Production Deploy               │
│  - Blue/Green OR Canary rollout  │
│  - Health checks post-deploy     │
│  - Auto-rollback on error spike  │
└──────────────────────────────────┘

Deployment Strategies

Blue/Green Deployment

Maintain two identical environments (blue = current, green = new). Deploy to green, run smoke tests, switch load balancer traffic to green. Rollback = switch back to blue instantly. Requires 2× infrastructure cost temporarily.

Canary Deployment

Route 1-5% of traffic to new version. Monitor error rates and latency. Gradually increase to 10%, 25%, 50%, 100%. Auto-rollback if error rate exceeds threshold. Used by Facebook, Google for gradual feature releases.

# Canary rollout progression
stages = [
    {"weight": 1,   "wait_minutes": 5,  "error_threshold": 0.01},
    {"weight": 5,   "wait_minutes": 10, "error_threshold": 0.01},
    {"weight": 25,  "wait_minutes": 30, "error_threshold": 0.005},
    {"weight": 100, "wait_minutes": 0,  "error_threshold": 0},
]

Rolling Deployment

Replace instances one batch at a time (e.g., 10% of fleet at once). Slower than blue/green, less infrastructure, but mixed versions run concurrently — requires backward compatibility.

Build System Design

  • Build workers: ephemeral containers, auto-scaled from a pool. Each build gets a fresh isolated environment.
  • Build caching: cache Docker layers, npm/pip dependencies by hash. Cache key = hash(package.json) or hash(requirements.txt). A cache hit reduces a 5-minute build to 30 seconds.
  • Parallelism: fan-out test suites across multiple workers, merge results. Large test suites (10,000+ tests) run in parallel shards.
  • Build queue: Kafka or SQS. Multiple builds queued; priority queue for main branch builds over feature branches.

Artifact Management

  • Tag every artifact with git commit SHA, branch, and build timestamp
  • Immutable artifacts: never overwrite — create new artifact per build
  • Retention policy: keep last N successful builds per branch; keep all production deploys for 90 days
  • Artifact signing: sign Docker images or tarballs to prevent tampered deployments

Rollback Mechanism

  • Fast rollback: keep previous artifact version ready; switch load balancer or Kubernetes deployment back in < 60 seconds
  • Automatic rollback triggers: error rate > threshold, P99 latency spike, health check failures after deploy
  • Database migration rollback: hardest part. Always make migrations backward-compatible (add columns before removing old ones). Maintain migration version in DB.

Observability

  • Build logs: stream to centralized log store (Elasticsearch), retained for 30 days
  • Deployment events: emit to event bus (PagerDuty, Slack notifications on deploy start/success/failure)
  • Deploy dashboard: current version per service, recent deploy history, rollback button
  • Metrics: build duration P50/P95/P99, build success rate, deploy frequency (DORA metric)

Interview Checklist

  • Draw the full pipeline: push → build → test → artifact → staging → production
  • Explain build caching and parallelism for fast builds
  • Compare blue/green vs canary vs rolling; know when to use each
  • Address rollback: both application-level and database migration rollback
  • Mention DORA metrics: deployment frequency, lead time, MTTR, change failure rate

  • Twitter Interview Guide
  • Airbnb Interview Guide
  • Shopify Interview Guide
  • Atlassian Interview Guide
  • Cloudflare Interview Guide
  • Stripe Interview Guide
  • {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is the difference between blue/green and canary deployment?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Blue/green deployment maintains two identical environments—switch all traffic instantly from old (blue) to new (green). Rollback = switch back in seconds. Downside: requires 2x infrastructure. Canary deployment routes a small percentage (1-5%) of traffic to the new version, monitors for errors, then gradually increases to 100%. Slower but safer—you catch bugs before they affect all users. Blue/green is better for clear go/no-go releases; canary is better for gradual feature rollouts with monitoring.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do you make database migrations safe in a CI/CD pipeline?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Follow the expand-contract pattern: (1) Expand — add the new column or table without removing the old one. Both old and new code can run simultaneously. (2) Migrate data. (3) Deploy new code that uses the new schema. (4) Contract — remove the old column in a later deploy after all traffic is on new code. Never add NOT NULL columns without a default, and never rename or drop columns in the same deploy that changes the application code.” }
    },
    {
    “@type”: “Question”,
    “name”: “What are DORA metrics and why do they matter?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “DORA (DevOps Research and Assessment) metrics measure software delivery performance: (1) Deployment Frequency — how often code ships to production. (2) Lead Time for Changes — time from commit to production. (3) Change Failure Rate — percentage of deploys causing incidents. (4) Mean Time to Recovery (MTTR) — how fast you recover from incidents. Elite teams deploy multiple times per day with <1 hour lead time, <15% change failure rate, and <1 hour MTTR.” }
    }
    ]
    }

    Scroll to Top