CI/CD Pipeline System Low-Level Design

What is a CI/CD Pipeline?

CI (Continuous Integration): automatically build and test code on every commit. CD (Continuous Delivery/Deployment): automatically deploy tested code to staging and production. A pipeline automates the path from “developer pushes code” to “users see the change,” reducing manual steps and deployment risk.

Requirements

  • Trigger on push/PR to run build, lint, test, and security scan
  • Deploy to staging automatically; deploy to production on approval or automatically
  • Support parallel job execution and job dependencies (DAG)
  • Artifact storage: Docker images, compiled binaries
  • 100 engineers, 500 pipeline runs/day, each run takes 2-20 minutes

Pipeline Architecture

Git Push → Webhook → Pipeline API → Pipeline DB (run metadata)
                                   → Job Queue (Kafka / Postgres SKIP LOCKED)
                                   → Job Runner Pool (K8s pods / VMs)
                                     → Step execution (build/test/deploy)
                                     → Artifact Store (S3)
                                     → Log Streamer (WebSocket → browser)
                                   → Notification Service → Slack/Email

Data Model

Pipeline(pipeline_id, repo_id, name, config_path, trigger ENUM(PUSH,PR,SCHEDULE,MANUAL))

PipelineRun(run_id UUID, pipeline_id, commit_sha, branch, triggered_by,
            status ENUM(PENDING,RUNNING,SUCCESS,FAILURE,CANCELLED),
            started_at, finished_at, duration_s)

Job(job_id UUID, run_id, name, stage, status, runner_id, started_at, finished_at)
JobDependency(job_id, depends_on_job_id)

Step(step_id UUID, job_id, name, command, status, exit_code, started_at, finished_at)
StepLog(step_id, log_chunk_id, content TEXT, offset INT, created_at)

Artifact(artifact_id UUID, run_id, name, type ENUM(DOCKER,BINARY,TEST_REPORT),
         storage_path, size_bytes, created_at)

Job Scheduling (DAG Execution)

Jobs form a directed acyclic graph (DAG) via dependencies. Topological sort determines execution order. A job becomes runnable when all its dependencies have succeeded. Scheduler loop:

while run not complete:
    runnable = [job for job in run.jobs
                if job.status == PENDING
                and all(dep.status == SUCCESS for dep in job.dependencies)]
    for job in runnable:
        enqueue(job)  # publish to job queue
    if any(job.status == FAILURE for job in run.jobs):
        cancel_downstream(failed_job)  # cancel jobs that depended on failed job
        break

Runner Architecture

Runners are stateless workers that poll the job queue, execute steps, and report results. Options: (1) Kubernetes pods: each job gets a fresh pod, ephemeral — no state leaks between runs. Scale via HPA (Horizontal Pod Autoscaler). (2) Pre-warmed VM pool: faster startup (~5s vs ~30s for cold pod). Used when build times are short. (3) Docker-in-Docker: for jobs that build Docker images. Security concern: requires privileged mode. Use Kaniko or Buildah as rootless alternatives. Runner registration: runners register with the pipeline API (runner_id, capabilities, concurrency). Job queue: a Postgres table with SELECT … FOR UPDATE SKIP LOCKED is simple and reliable for hundreds of concurrent runners.

Artifact Management

Build artifacts (Docker images, binaries, test reports) are stored in S3. Docker images are pushed to a container registry (ECR, GCR). Artifact metadata stored in the Artifact table. Retention policy: keep artifacts for 30 days for feature branches, 90 days for main. Cleanup job: runs nightly, deletes expired artifacts from S3 and DB. Artifact reuse: if the same commit SHA was already built successfully, skip the build step and reuse the existing artifact (content-addressable cache keyed by commit SHA + Dockerfile hash).

Log Streaming

Engineers watch live logs during a job run. Architecture: runner streams log chunks to the log API (HTTP POST) as they are produced. Log API stores chunks in StepLog table and publishes to Redis Pub/Sub channel step:{step_id}. Browser connects via WebSocket to the log streaming server, which subscribes to Redis Pub/Sub and forwards chunks in real time. On reconnect: load historical chunks from DB, then subscribe to Redis for live updates.

Deployment Stage

Deploy steps: (1) Push Docker image to registry. (2) Update Kubernetes deployment: kubectl set image deployment/{name} {container}={image}:{tag}. (3) Monitor rollout: watch for READY replicas to equal desired count; watch for pod crash loops. (4) Run smoke tests against the new deployment. (5) On failure: automatic rollback — kubectl rollout undo. (6) Notify: Slack message with commit link, deployer, and environment. Blue-green deployment: spin up new pods alongside old, switch load balancer, terminate old pods after health checks pass.

Key Design Decisions

  • Postgres SKIP LOCKED for job queue — simple, reliable, no additional queue infrastructure
  • Kubernetes runners — stateless, ephemeral, auto-scaling
  • Artifact caching by commit SHA — avoids redundant builds on re-runs
  • Redis Pub/Sub for log streaming — decouples log storage from live delivery
  • Automatic rollback on deploy failure — reduces MTTR for bad deployments


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a CI/CD pipeline execute jobs as a DAG?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Jobs in a pipeline form a directed acyclic graph (DAG): each job may depend on the successful completion of other jobs. The scheduler uses topological ordering to determine execution order. Implementation: on pipeline run start, find all jobs with no dependencies — these are runnable immediately. Enqueue them. When a job completes successfully: check its dependent jobs. For each dependent, check if ALL its dependencies are now complete. If yes, enqueue it. If a job fails: mark all transitive dependents as CANCELLED (they cannot run because their dependency failed). This is a BFS/DFS traversal of the downstream dependency subgraph. Parallel execution: jobs at the same DAG level (no dependency on each other) run concurrently — limited by the runner pool size. Example pipeline DAG: [lint, unit_tests] → [integration_tests] → [build_docker] → [deploy_staging] → [smoke_tests] → [deploy_prod].”}},{“@type”:”Question”,”name”:”How does a CI/CD system use Postgres SKIP LOCKED for job queuing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”SELECT … FOR UPDATE SKIP LOCKED turns a Postgres table into a reliable job queue without a separate message broker. Schema: jobs table with status = PENDING/RUNNING/DONE/FAILED and locked_by. Workers run: BEGIN; SELECT * FROM jobs WHERE status = 'PENDING' ORDER BY created_at LIMIT 1 FOR UPDATE SKIP LOCKED; UPDATE jobs SET status = 'RUNNING', locked_by = 'runner-1', started_at = NOW() WHERE job_id = …; COMMIT. SKIP LOCKED skips rows locked by other transactions — multiple runners compete without blocking each other. Each runner gets a different job. After processing: UPDATE status = 'DONE'. For failures: UPDATE status = 'FAILED'. Advantages: no extra infrastructure, transactional (job visibility and claiming are atomic), easy to query state. Disadvantages: polling adds DB load at high throughput. For 100-500 jobs/second, Postgres handles it fine. For > 1K jobs/second, use Kafka or a dedicated queue (RabbitMQ, SQS).”}},{“@type”:”Question”,”name”:”How do you implement live log streaming in a CI/CD system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Engineers watching a running job need to see log output in real time. Architecture: (1) Runner writes log chunks (every 1-4KB or every second) to the log API via HTTP POST. (2) Log API stores chunks in DB (ordered by offset) and publishes to Redis Pub/Sub channel log:{step_id}. (3) Browser opens WebSocket connection to the log streaming server. (4) Streaming server subscribes to Redis Pub/Sub for that step_id and forwards new chunks to the WebSocket client. (5) On reconnect: client sends last_offset; server replays chunks from DB with offset > last_offset, then subscribes to Pub/Sub for live updates. This architecture handles: late-joining viewers (fetch historical from DB), reconnects (resume from last seen offset), multiple concurrent viewers (all subscribe to the same Redis channel). Log storage: compress old logs with gzip for archival. Keep in DB for 30 days, then archive to S3.”}},{“@type”:”Question”,”name”:”What deployment strategies does a CI/CD pipeline support?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rolling deployment: replace old pods with new pods gradually. Kubernetes default. Zero-downtime if the new version is backward compatible. Rollback: kubectl rollout undo. Blue-green deployment: run old (blue) and new (green) environments simultaneously. Switch the load balancer from blue to green atomically. Instant rollback (switch back to blue). Requires 2x infrastructure during transition. Best for: high-risk releases where instant rollback is critical. Canary deployment: route a small percentage of traffic (1-5%) to the new version. Monitor error rates and latency. Gradually increase traffic to 100% if metrics are good. Rollback: route 100% back to old version. Best for: gradual risk reduction with production validation. Feature flags: deploy the code but keep the feature disabled. Enables deployment without activation — the safest approach. Activate via config change, no deployment needed. Each strategy has a CI/CD implementation: canary uses weighted routing (Istio, NGINX), blue-green uses load balancer target group switching.”}},{“@type”:”Question”,”name”:”How does artifact caching work in a CI/CD pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rebuilding from scratch on every commit is wasteful. Artifact caching stores build outputs and reuses them when inputs haven't changed. Layer 1 — Dependency cache: cache the dependency install step (npm install, pip install). Cache key: hash of package.json or requirements.txt. If the hash is unchanged, restore the cache and skip install. Layer 2 — Build cache: Docker layer caching. If the Dockerfile and source files for a layer haven't changed, reuse the cached layer. Docker BuildKit enables advanced cache mounts. Layer 3 — Test result cache: if the test files and source code haven't changed since the last passing run, skip rerunning the tests. Cache key: hash of all files that affect the test. Layer 4 — Compiled artifact reuse: if the same commit SHA was already built, use the existing Docker image without rebuilding. Cache key: commit_sha + Dockerfile_hash. Storage: cache blobs in S3. On cache miss: build and upload. On hit: download and restore. Cache invalidation: any change to the cache key inputs invalidates the cache.”}}]}

Atlassian system design (Bitbucket/Jira) covers CI/CD pipeline architecture. See common questions for Atlassian interview: CI/CD pipeline and DevOps system design.

Google system design covers large-scale build and CI/CD systems. Review patterns for Google interview: CI/CD pipeline and build system design.

Databricks system design covers deployment pipelines and automation. See design patterns for Databricks interview: CI/CD and deployment pipeline design.

Scroll to Top