Low Level Design: CI/CD Pipeline – Tech Interview Dot Org

What Is a CI/CD Pipeline?

A CI/CD pipeline automates the path from a code commit to a running production deployment. Continuous Integration (CI) validates every change — compiling, testing, linting. Continuous Delivery/Deployment (CD) packages the artifact and rolls it out to environments. The system must be fast, auditable, and safe: a bad deploy should be detectable and reversible within minutes.

Data Model


-- Pipelines (per-repo configuration)
pipelines (
  id            BIGINT PRIMARY KEY,
  repo_id       BIGINT NOT NULL,
  config_path   VARCHAR(512),   -- e.g., .ci/pipeline.yml
  created_at    TIMESTAMP
);

-- Pipeline Runs
pipeline_runs (
  id            BIGINT PRIMARY KEY,
  pipeline_id   BIGINT REFERENCES pipelines(id),
  trigger       ENUM('push','pr','schedule','manual'),
  commit_sha    CHAR(40),
  branch        VARCHAR(255),
  status        ENUM('pending','running','success','failed','cancelled'),
  started_at    TIMESTAMP,
  finished_at   TIMESTAMP
);

-- Stages (ordered groups of jobs)
stages (
  id            BIGINT PRIMARY KEY,
  run_id        BIGINT REFERENCES pipeline_runs(id),
  name          VARCHAR(255),
  order_index   INT,
  status        ENUM('waiting','running','success','failed','skipped')
);

-- Jobs (individual units of work)
jobs (
  id            BIGINT PRIMARY KEY,
  stage_id      BIGINT REFERENCES stages(id),
  name          VARCHAR(255),
  runner_id     BIGINT,
  image         VARCHAR(512),
  status        ENUM('queued','running','success','failed'),
  exit_code     INT,
  log_url       TEXT,
  started_at    TIMESTAMP,
  finished_at   TIMESTAMP
);

-- Artifacts
artifacts (
  id            BIGINT PRIMARY KEY,
  job_id        BIGINT REFERENCES jobs(id),
  name          VARCHAR(255),
  storage_key   TEXT,
  size_bytes    BIGINT,
  expires_at    TIMESTAMP
);

-- Deployments
deployments (
  id            BIGINT PRIMARY KEY,
  run_id        BIGINT REFERENCES pipeline_runs(id),
  environment   ENUM('staging','production'),
  strategy      ENUM('rolling','blue_green','canary'),
  status        ENUM('pending','running','success','rolled_back'),
  deployed_at   TIMESTAMP
);

Core Workflow

Trigger: A webhook from the VCS (push or PR event) hits the pipeline API. The API validates the payload, resolves the pipeline config at the given commit SHA (fetched from object storage or parsed on-the-fly), and inserts a pipeline_run row.
DAG Scheduling: The config defines stages in order; jobs within a stage run in parallel. A scheduler process polls for pending runs, builds the DAG, and enqueues job tasks onto a job queue (e.g., Redis Streams or Kafka).
Runner Execution: Runner agents (ephemeral VMs or containers) pull jobs from the queue, spin up the specified Docker image, execute steps, stream logs to object storage line-by-line, and report status back via gRPC heartbeats.
Artifact Promotion: On job success, built artifacts (binaries, Docker images) are uploaded and registered in the artifacts table. Downstream jobs reference them by artifact name.
Deployment: The deploy stage invokes the orchestration layer (Kubernetes, ECS, etc.), creates a deployments row, and monitors rollout health metrics. On success, the run is marked complete.

Failure Handling

Runner crash mid-job: Runners send heartbeats every 10 seconds. A watchdog marks jobs as failed if no heartbeat arrives within 30 seconds, then re-queues with a retry counter. Max 3 retries before permanent failure.
Flaky tests: Support a retry-on-failure count per job in config. Track flakiness rate per test case in a separate analytics table to surface chronic offenders.
Failed deployment: The deploy job monitors error rate and p99 latency via metrics API. If thresholds are breached within a configurable window, it triggers automatic rollback by redeploying the previous artifact SHA.
Queue backup: If the job queue depth exceeds a threshold, autoscale the runner fleet. Shed low-priority jobs (scheduled runs) first to protect PR and push-triggered runs.

Scalability Considerations

Log streaming: Logs are write-once, read-sometimes. Stream directly to object storage in chunks; serve via pre-signed URLs. Do not store logs in the primary database.
Cache layers: Cache Docker layer pulls per runner host. Use a shared layer cache registry (e.g., a pull-through cache) to avoid redundant downloads across runners.
Multi-region runners: Place runner pools close to VCS and artifact storage regions to cut network latency for large artifact transfers.
Database partitioning: Partition pipeline_runs and jobs by created_at month. Purge or archive partitions older than the retention window (e.g., 90 days) without locking live tables.

Summary

A CI/CD pipeline is a DAG executor with an audit trail. The scheduling logic — resolving stage dependencies, distributing jobs, handling partial failures — is the core intellectual challenge. Everything else (log storage, artifact management, deployment strategies) is important but compositional. Invest in observability from day one: mean time to detect a broken build and mean time to deploy are your two north-star latency metrics.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the low-level design of a CI/CD pipeline system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A low-level design of a CI/CD pipeline system defines the classes, database tables, and service interactions required to automate building, testing, and deploying software. Key components include a Pipeline definition model, a Job scheduler, an Artifact store, and a Deployment engine that executes stages in dependency order.”
}
},
{
“@type”: “Question”,
“name”: “How do Google and Amazon design CI/CD systems internally?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Google’s internal build and release systems (like Blaze/Bazel and Borg-based release tooling) and Amazon’s CodePipeline use distributed task queues, isolated build environments (containers or VMs), and event-driven triggers from version control. Both prioritize hermetic builds, incremental caching, and rollback mechanisms for safe deployments.”
}
},
{
“@type”: “Question”,
“name”: “What data models are required for a CI/CD pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Core models include: Pipeline (id, repo_id, config_yaml, created_by), PipelineRun (id, pipeline_id, trigger_type, commit_sha, status, started_at, finished_at), Job (id, run_id, name, stage, status, worker_id, logs_url), and Artifact (id, run_id, job_id, storage_path, checksum). Status fields use enums: PENDING, RUNNING, SUCCESS, FAILED, CANCELLED.”
}
},
{
“@type”: “Question”,
“name”: “How does a CI/CD pipeline handle failures and retries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Failure handling in CI/CD pipelines typically uses exponential backoff retries for transient errors, configurable max_retries per job, dead-letter queues for permanently failed jobs, and automatic rollback triggers when a deployment job fails. Atlassian’s Bamboo and similar tools also support manual retry gates and approval steps to prevent cascading failures in production deployments.”
}
}
]
}