How does GitHub Actions securely execute untrusted code?

Security model: (1) Ephemeral VMs -- GitHub-hosted runners are destroyed after each job. No state leakage between jobs or repositories. (2) Secret scoping -- fork PRs do NOT have access to repository secrets (prevents malicious PRs from exfiltrating secrets). Secrets are masked in logs automatically. (3) GITHUB_TOKEN permissions -- scoped per-workflow. Default: read for pull_request, read-write for push. Restrict with permissions: key. (4) OIDC federation -- for cloud deployments, use OIDC instead of long-lived secrets. The workflow receives a short-lived JWT validated by the cloud provider. No secrets stored in GitHub. (5) Third-party action pinning -- pin to SHA (not tag) to prevent supply chain attacks: uses: actions/checkout@abc123. (6) Self-hosted runner risk -- self-hosted runners persist between jobs. NEVER use for public repos (untrusted PR code executes). Use --ephemeral flag so runners deregister after one job.

System Design: Design GitHub Actions — CI/CD Platform, Workflow Engine, Runner Architecture, Artifact Storage, Secrets

⏱ 6 min read

GitHub Actions processes millions of workflow runs per day for millions of repositories. Designing a CI/CD platform tests your understanding of: workflow orchestration (DAG execution), ephemeral compute (runners), artifact and log management, secret storage, and the security model for executing untrusted code from pull requests. This guide covers the internal architecture for platform engineering interviews.

Workflow Engine

A workflow is a YAML-defined DAG of jobs. Each job contains steps (individual commands or action references). The workflow engine orchestrates execution. Components: (1) Event listener — watches for triggering events: push, pull_request, schedule, workflow_dispatch, repository_dispatch. When an event matches a workflow trigger: create a workflow run. (2) Planner — parses the workflow YAML. Builds the job dependency graph (DAG). Determines: which jobs can run in parallel (no dependencies), which must wait (needs: [job_a, job_b]), and which are conditional (if: success() or if: github.event_name == “push”). (3) Job scheduler — for each ready job: find an available runner (matching the runs-on label: ubuntu-latest, windows-latest, self-hosted), assign the job, and dispatch. (4) Step executor — on the runner: execute each step sequentially. Steps: checkout code, setup runtime (setup-node, setup-python), run commands (npm test), or call actions (third-party reusable steps). (5) Status reporter — report job/step status back to the workflow engine. Update the workflow run state. Notify GitHub (check run status on the PR). Concurrency control: the concurrency key limits parallel runs. concurrency: group: deploy-prod, cancel-in-progress: true — only one deployment runs at a time; new runs cancel the previous. Matrix strategy: run the same job with different parameter combinations (OS x Node version = 6 parallel jobs). The planner expands the matrix into individual jobs.

Runner Architecture

Runners are the compute instances that execute jobs. Two types: (1) GitHub-hosted runners — ephemeral VMs provisioned by GitHub. Each job gets a fresh VM (no state leakage between jobs or repositories). VM images: Ubuntu (latest LTS), Windows Server, and macOS. Pre-installed tools: common runtimes (Node, Python, Java, Go), Docker, and build tools. After the job completes: the VM is destroyed (no persistent state). Billing: per-minute usage (free tier: 2000 min/month for public repos). (2) Self-hosted runners — customer-managed machines registered with GitHub. The runner agent polls GitHub for jobs matching its labels. Use for: custom hardware (GPU, ARM), private network access (deploy to on-prem), and cost optimization (large-scale usage). Security: self-hosted runners persist between jobs — one job may leave malicious state for the next. For public repos: NEVER use self-hosted runners (untrusted PR code executes on your machine). Runner scaling: GitHub-hosted runners auto-scale to demand (thousands of concurrent VMs). Self-hosted: use autoscaling solutions (actions-runner-controller for Kubernetes) that spin up runners on demand and destroy after each job. Ephemeral self-hosted: configure –ephemeral flag so the runner deregisters after one job (fresh for each job, like GitHub-hosted). This is the recommended approach for security.

Artifact and Log Storage

Jobs produce artifacts (build outputs, test results, coverage reports) and logs (step output, error messages). Artifacts: uploaded with actions/upload-artifact. Stored in object storage (S3/GCS). Retained for 90 days (configurable). Accessible via the Actions UI and API. Use for: passing data between jobs (job A builds, job B deploys the artifact), test reports, and release binaries. Log storage: each step produces stdout/stderr. Streamed to the server in real-time (the UI shows live logs). Stored compressed after completion. Retained for the workflow run lifetime (90 days). Large logs are truncated (max 500 KB per step visible in UI, full log downloadable). Log masking: secrets registered in the repository settings are automatically masked in logs. If a secret value appears in stdout, it is replaced with ***. This prevents accidental secret exposure in public logs. Cache: actions/cache stores and restores dependency caches (node_modules, pip packages, Go modules) between runs. Key-based: cache key = hashFiles(“package-lock.json”). If the lock file changes: cache miss (fresh install). If unchanged: cache hit (skip install, save minutes). Cache storage: per-repository, limited to 10 GB. LRU eviction. Cache hits reduce run time by 50-80% for typical CI workflows.

Security Model

CI/CD platforms execute arbitrary code — security is paramount. Threat model: (1) Secret exfiltration — a malicious workflow step reads secrets and sends them to an external server. Mitigation: secrets are only available to workflows triggered by trusted events (not pull_request from forks — fork PRs do not have access to repository secrets). (2) Supply chain attacks — a third-party action (uses: attacker/malicious-action@v1) executes malicious code. Mitigation: pin actions to SHA (uses: actions/checkout@abc123), use trusted actions only, and review third-party action code. (3) Runner escape — code in one job affects another. Mitigation: GitHub-hosted runners are ephemeral VMs (destroyed after each job). Self-hosted: use –ephemeral. (4) GITHUB_TOKEN scope — the automatic token has permissions scoped per-workflow. Default: read for pull_request, read-write for push. Restrict with permissions: key in YAML. Principle of least privilege. OpenID Connect (OIDC): for deploying to cloud providers (AWS, GCP, Azure), use OIDC federation instead of long-lived secrets. The workflow receives a short-lived JWT from GitHub. The cloud provider validates the JWT and issues temporary credentials. No secrets stored in GitHub — the trust is established via OIDC federation. This eliminates secret rotation and reduces blast radius of a compromised workflow.

Scaling and Multi-Tenancy

GitHub Actions serves millions of repositories with varying load patterns. Scaling challenges: (1) Burst handling — a popular open-source project receives 100 PRs in an hour. Each PR triggers 5 workflow runs. 500 concurrent runs from one repository. The runner pool must scale to handle bursts without affecting other repositories. Per-repository concurrency limits prevent monopolization. (2) Runner provisioning latency — a job is queued. How quickly is a runner available? For GitHub-hosted: target < 30 seconds. Pre-warm a pool of ready VMs. Scale the pool based on historical demand patterns (more VMs ready during US business hours). (3) Multi-tenancy — repositories share the runner infrastructure. A cryptocurrency mining workflow (CPU-intensive) must not degrade other workflows. Mitigation: per-job time limits (6 hours max), per-repository monthly minute quotas, and abuse detection (flag abnormal CPU/network patterns). (4) Global distribution — runners in multiple regions. Jobs are assigned to the nearest region to the repository (reducing clone time for large repos). Cross-region overflow during regional peaks. (5) Queue management — when demand exceeds runner capacity: jobs queue. Priority: paid accounts over free tier. Within paid: FIFO. Show queue position and estimated wait time in the UI. During peak times (Monday morning US): queues may grow. The platform autoscales to reduce wait times within SLA (< 2 minutes for paid accounts).