What is a CI/CD Pipeline?
CI (Continuous Integration): automatically build and test code on every commit. CD (Continuous Delivery/Deployment): automatically deploy tested code to staging and production. A pipeline automates the path from “developer pushes code” to “users see the change,” reducing manual steps and deployment risk.
Requirements
- Trigger on push/PR to run build, lint, test, and security scan
- Deploy to staging automatically; deploy to production on approval or automatically
- Support parallel job execution and job dependencies (DAG)
- Artifact storage: Docker images, compiled binaries
- 100 engineers, 500 pipeline runs/day, each run takes 2-20 minutes
Pipeline Architecture
Git Push → Webhook → Pipeline API → Pipeline DB (run metadata)
→ Job Queue (Kafka / Postgres SKIP LOCKED)
→ Job Runner Pool (K8s pods / VMs)
→ Step execution (build/test/deploy)
→ Artifact Store (S3)
→ Log Streamer (WebSocket → browser)
→ Notification Service → Slack/Email
Data Model
Pipeline(pipeline_id, repo_id, name, config_path, trigger ENUM(PUSH,PR,SCHEDULE,MANUAL))
PipelineRun(run_id UUID, pipeline_id, commit_sha, branch, triggered_by,
status ENUM(PENDING,RUNNING,SUCCESS,FAILURE,CANCELLED),
started_at, finished_at, duration_s)
Job(job_id UUID, run_id, name, stage, status, runner_id, started_at, finished_at)
JobDependency(job_id, depends_on_job_id)
Step(step_id UUID, job_id, name, command, status, exit_code, started_at, finished_at)
StepLog(step_id, log_chunk_id, content TEXT, offset INT, created_at)
Artifact(artifact_id UUID, run_id, name, type ENUM(DOCKER,BINARY,TEST_REPORT),
storage_path, size_bytes, created_at)
Job Scheduling (DAG Execution)
Jobs form a directed acyclic graph (DAG) via dependencies. Topological sort determines execution order. A job becomes runnable when all its dependencies have succeeded. Scheduler loop:
while run not complete:
runnable = [job for job in run.jobs
if job.status == PENDING
and all(dep.status == SUCCESS for dep in job.dependencies)]
for job in runnable:
enqueue(job) # publish to job queue
if any(job.status == FAILURE for job in run.jobs):
cancel_downstream(failed_job) # cancel jobs that depended on failed job
break
Runner Architecture
Runners are stateless workers that poll the job queue, execute steps, and report results. Options: (1) Kubernetes pods: each job gets a fresh pod, ephemeral — no state leaks between runs. Scale via HPA (Horizontal Pod Autoscaler). (2) Pre-warmed VM pool: faster startup (~5s vs ~30s for cold pod). Used when build times are short. (3) Docker-in-Docker: for jobs that build Docker images. Security concern: requires privileged mode. Use Kaniko or Buildah as rootless alternatives. Runner registration: runners register with the pipeline API (runner_id, capabilities, concurrency). Job queue: a Postgres table with SELECT … FOR UPDATE SKIP LOCKED is simple and reliable for hundreds of concurrent runners.
Artifact Management
Build artifacts (Docker images, binaries, test reports) are stored in S3. Docker images are pushed to a container registry (ECR, GCR). Artifact metadata stored in the Artifact table. Retention policy: keep artifacts for 30 days for feature branches, 90 days for main. Cleanup job: runs nightly, deletes expired artifacts from S3 and DB. Artifact reuse: if the same commit SHA was already built successfully, skip the build step and reuse the existing artifact (content-addressable cache keyed by commit SHA + Dockerfile hash).
Log Streaming
Engineers watch live logs during a job run. Architecture: runner streams log chunks to the log API (HTTP POST) as they are produced. Log API stores chunks in StepLog table and publishes to Redis Pub/Sub channel step:{step_id}. Browser connects via WebSocket to the log streaming server, which subscribes to Redis Pub/Sub and forwards chunks in real time. On reconnect: load historical chunks from DB, then subscribe to Redis for live updates.
Deployment Stage
Deploy steps: (1) Push Docker image to registry. (2) Update Kubernetes deployment: kubectl set image deployment/{name} {container}={image}:{tag}. (3) Monitor rollout: watch for READY replicas to equal desired count; watch for pod crash loops. (4) Run smoke tests against the new deployment. (5) On failure: automatic rollback — kubectl rollout undo. (6) Notify: Slack message with commit link, deployer, and environment. Blue-green deployment: spin up new pods alongside old, switch load balancer, terminate old pods after health checks pass.
Key Design Decisions
- Postgres SKIP LOCKED for job queue — simple, reliable, no additional queue infrastructure
- Kubernetes runners — stateless, ephemeral, auto-scaling
- Artifact caching by commit SHA — avoids redundant builds on re-runs
- Redis Pub/Sub for log streaming — decouples log storage from live delivery
- Automatic rollback on deploy failure — reduces MTTR for bad deployments
Atlassian system design (Bitbucket/Jira) covers CI/CD pipeline architecture. See common questions for Atlassian interview: CI/CD pipeline and DevOps system design.
Google system design covers large-scale build and CI/CD systems. Review patterns for Google interview: CI/CD pipeline and build system design.
Databricks system design covers deployment pipelines and automation. See design patterns for Databricks interview: CI/CD and deployment pipeline design.