Low Level Design: Container Registry – Tech Interview Dot Org

What Is a Container Registry?

A container registry is a storage and distribution system for container images. It stores named, versioned image manifests and the binary layer blobs they reference. Clients (Docker, containerd, Kubernetes) pull images before running containers and push images after building them. Designing a registry well requires solving image deduplication at the layer level, atomic manifest publishing, fine-grained access control, and safe garbage collection without interrupting live pulls.

Data Model

The registry is backed by a relational metadata store and a blob object store (S3-compatible or local filesystem).

CREATE TABLE repositories (
    id            BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    namespace     VARCHAR(128)  NOT NULL,
    name          VARCHAR(128)  NOT NULL,
    visibility    ENUM('public','private') NOT NULL DEFAULT 'private',
    created_at    DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY uq_repo (namespace, name)
);

CREATE TABLE blobs (
    digest        CHAR(71)      NOT NULL PRIMARY KEY,  -- sha256:<hex>
    size_bytes    BIGINT UNSIGNED NOT NULL,
    storage_path  VARCHAR(512)  NOT NULL,
    uploaded_at   DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP,
    ref_count     INT UNSIGNED  NOT NULL DEFAULT 0
);

CREATE TABLE manifests (
    id            BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    repo_id       BIGINT UNSIGNED NOT NULL,
    digest        CHAR(71)      NOT NULL,
    media_type    VARCHAR(256)  NOT NULL,
    payload       MEDIUMBLOB    NOT NULL,
    created_at    DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY uq_manifest (repo_id, digest),
    FOREIGN KEY (repo_id) REFERENCES repositories(id)
);

CREATE TABLE tags (
    id            BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    repo_id       BIGINT UNSIGNED NOT NULL,
    name          VARCHAR(128)  NOT NULL,
    manifest_id   BIGINT UNSIGNED NOT NULL,
    updated_at    DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    UNIQUE KEY uq_tag (repo_id, name),
    FOREIGN KEY (repo_id)    REFERENCES repositories(id),
    FOREIGN KEY (manifest_id) REFERENCES manifests(id)
);

CREATE TABLE manifest_layers (
    manifest_id   BIGINT UNSIGNED NOT NULL,
    blob_digest   CHAR(71)      NOT NULL,
    layer_order   SMALLINT UNSIGNED NOT NULL,
    PRIMARY KEY (manifest_id, layer_order),
    FOREIGN KEY (manifest_id)  REFERENCES manifests(id),
    FOREIGN KEY (blob_digest)  REFERENCES blobs(digest)
);

CREATE TABLE upload_sessions (
    uuid          CHAR(36)      NOT NULL PRIMARY KEY,
    repo_id       BIGINT UNSIGNED NOT NULL,
    offset_bytes  BIGINT UNSIGNED NOT NULL DEFAULT 0,
    state         ENUM('active','complete','expired') NOT NULL DEFAULT 'active',
    started_at    DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP,
    expires_at    DATETIME      NOT NULL
);

CREATE TABLE access_policies (
    id            BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    repo_id       BIGINT UNSIGNED NOT NULL,
    principal     VARCHAR(256)  NOT NULL,
    principal_type ENUM('user','service_account','team') NOT NULL,
    permission    ENUM('pull','push','admin') NOT NULL,
    UNIQUE KEY uq_policy (repo_id, principal),
    FOREIGN KEY (repo_id) REFERENCES repositories(id)
);

CREATE TABLE gc_marks (
    blob_digest   CHAR(71)      NOT NULL PRIMARY KEY,
    marked_at     DATETIME      NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Core Workflows

Blob Upload (Chunked)

The OCI Distribution Spec defines a two-phase blob upload. The client first POSTs to /v2/<repo>/blobs/uploads/ to obtain a session UUID, then PATCHes one or more chunks, and finally PUTs with the expected digest to commit.

POST creates an upload_sessions row with a UUID and expires_at = NOW() + 1 hour.
Each PATCH appends chunk bytes to a staging file (keyed by UUID) and updates offset_bytes atomically with an optimistic-lock CAS on offset_bytes = expected. Out-of-order chunks return 416.
PUT verifies the SHA-256 digest of the assembled file against the client-provided digest. On match, the blob is moved from staging to the permanent path, a blobs row is inserted (or ignored if duplicate), and the session is marked complete. On mismatch, the session is marked expired and the staging file is deleted.

Layer deduplication is free: if blobs.digest already exists, the PUT skips the move and increments ref_count. The client can also issue a HEAD before starting an upload; a 200 with Content-Length means the blob exists and the upload can be skipped entirely (cross-repo blob mount).

Manifest Push

Client PUTs to /v2/<repo>/manifests/<reference> where reference is a tag or digest.
Registry parses the manifest JSON, verifies that every layer digest referenced in the manifest exists in blobs. Missing layers return 400 BLOB_UNKNOWN.
Registry computes sha256(payload) as the manifest digest, inserts a manifests row, inserts manifest_layers rows, and upserts the tags row (if reference is a tag name). All three writes happen in a single transaction so partial state is never visible.
Blob ref_count is incremented for each layer (or batch-incremented via a single UPDATE with WHERE IN).

Image Pull

GET /v2/<repo>/manifests/<tag> — registry looks up tags JOIN manifests, returns the manifest payload with Docker-Content-Digest header.
Client extracts layer digests from the manifest and GETs each blob via /v2/<repo>/blobs/<digest>. Registry looks up blobs.storage_path and either streams the file or returns a 307 redirect to a pre-signed S3 URL (preferred for large blobs to avoid proxying bandwidth through the registry).

Access Control

The registry uses token-based auth following the Docker Auth spec. An unauthenticated request receives 401 with a WWW-Authenticate header pointing to the auth service. The client exchanges credentials for a signed JWT scoped to repository:<name>:pull,push. The registry validates the JWT (RS256, cached public key) on every request; no session state is needed on the registry side. The access_policies table is consulted by the auth service, not the registry itself, keeping the registry stateless with respect to authorization.

Garbage Collection

GC is the hardest part. A naive approach — delete blobs with ref_count = 0 — races with concurrent pushes. The safe approach is a two-phase mark-and-sweep:

Mark phase (online): Walk all manifests currently in the database, collect every referenced blob digest, and insert those digests into gc_marks with a timestamp. This read is done at a snapshot isolation level so it sees a consistent view.
Sweep phase (offline): List all blobs in object storage. Any blob whose digest is NOT in gc_marks AND whose uploaded_at is older than a safety window (e.g., 2 hours) is a candidate for deletion. The safety window protects blobs that were just uploaded but whose manifests have not yet been committed.
Sweep deletes from object storage first, then deletes the blobs row. If the process crashes between the two, the orphaned metadata row is harmless and will be swept again next cycle.

GC runs as a background job (cron or separate worker) and should be rate-limited to avoid saturating object storage I/O during peak hours.

Key Design Decisions and Trade-offs

Content-addressed storage: Blobs are keyed by digest, not by repository or name. This enables cross-repo deduplication — a base image layer shared by 1,000 repositories is stored once. The trade-off is that GC must walk all repositories to find live references.
Metadata in relational DB, blobs in object store: Relational DB gives ACID transactions for manifest commits; object store gives cheap, durable, parallelizable blob I/O. Mixing the two requires careful ordering (write blob first, then metadata) so a crash never leaves metadata pointing at a missing blob.
307 redirect vs proxy: Redirecting to pre-signed S3 URLs offloads bandwidth from the registry but exposes the storage backend URL to clients. In air-gapped or high-security environments, proxying through the registry is preferred at the cost of throughput.
Tag mutability: Tags are mutable by default (OCI spec). Immutable tags can be enforced by checking whether a tag row already exists and rejecting the PUT if it points to a different manifest — a repository-level policy stored in repositories.

Failure Handling and Edge Cases

Partial uploads: The upload_sessions.expires_at field allows a cleanup job to delete staging files and mark sessions expired after the TTL. Clients that resume a failed upload issue a GET on the upload URL to retrieve the current offset before sending the next PATCH.
Concurrent tag updates: Two clients pushing to the same tag simultaneously will both succeed, but the last writer wins at the DB level due to the UNIQUE constraint upsert. The registry returns the digest of what was committed in the response header so callers can detect unexpected overwrites.
Manifest referencing deleted blob: If a blob is deleted before the manifest commit transaction completes, the existence check in step 2 of Manifest Push catches it and returns BLOB_UNKNOWN. The client must re-upload the missing blob.
Large layer streaming: Layers can be tens of gigabytes. The registry must not buffer the entire blob in memory; it streams chunks to the staging file using a fixed-size buffer (e.g., 32 MB) and computes the rolling SHA-256 incrementally.

Scalability Considerations

Horizontal registry replicas: The registry is stateless (auth via JWT, blobs in shared object store, metadata in shared DB), so multiple registry instances can run behind a load balancer. Upload sessions reference the UUID only; any replica can handle any chunk of an upload because chunks are written to a shared staging prefix in object storage.
Read caching: Manifest payloads are immutable once committed (digest is the key). A CDN or Nginx proxy cache in front of the manifest and blob endpoints serves repeat pulls without hitting the registry at all. Cache-Control headers with long TTLs are safe for digest-addressed resources.
Database sharding: The repositories table can be sharded by namespace. All child tables (manifests, tags, blobs, layers) carry the repo_id, making per-shard queries straightforward. Blobs are a cross-shard concern; a separate global blob catalog (or content-addressed object storage with no catalog) handles deduplication across shards.
Rate limiting: Push operations are more expensive than pulls (SHA-256 computation, DB writes, GC impact). Apply stricter rate limits to push endpoints per repository per user using a token bucket in Redis.

Summary

A container registry is a content-addressed blob store with a thin relational metadata layer on top. The key insight is that layer deduplication happens naturally when blobs are keyed by digest, making storage costs nearly independent of the number of repositories sharing common base images. The trickiest parts are garbage collection (requires a safe mark window to avoid racing with concurrent pushes) and chunked upload resumability (requires server-side offset tracking). Authentication is cleanly separated from the registry via JWT token delegation, keeping the registry itself stateless and horizontally scalable. Serving blobs via pre-signed redirects to object storage is the dominant scalability lever, as it removes the registry from the data path for the bulk of transferred bytes.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a container registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A container registry is a centralized repository for storing, managing, and distributing container images. It acts as the source of truth for versioned image artifacts, allowing teams to push images after a build and pull them during deployment. Public registries like Docker Hub serve open-source images, while private registries such as Google Artifact Registry or Amazon ECR are used for proprietary workloads. The registry exposes an HTTP API (the OCI Distribution Specification) that clients use to authenticate, push layers, push manifests, and pull images by tag or digest.”
}
},
{
“@type”: “Question”,
“name”: “How are container image layers stored and deduplicated in a container registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Container images are built as a stack of read-only layers, each representing a filesystem diff. Each layer is content-addressed: its identifier is the SHA-256 digest of its compressed tar archive. When a client pushes an image, the registry checks whether a blob with that digest already exists before accepting the upload, so identical layers shared across many images are stored only once. This content-addressable storage (CAS) model provides automatic deduplication at the layer level. A manifest ties the layers together by listing their digests in order, so the registry can reconstruct the full image by fetching and stacking the referenced blobs.”
}
},
{
“@type”: “Question”,
“name”: “How does garbage collection work in a container registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Garbage collection (GC) in a container registry reclaims storage occupied by blobs that are no longer referenced by any manifest. The process runs in two phases. In the mark phase, the GC traverses all known manifests and records every blob digest they reference. In the sweep phase, it deletes any blob whose digest was not marked. Because blobs are shared across images, a blob is only deleted when every manifest that references it has been removed. Most registries implement soft-delete: a tag deletion removes the tag-to-manifest mapping first, making the manifest unreachable, and a subsequent GC run removes the manifest and its exclusively-owned blobs. Registries typically run GC offline or in a background job to avoid race conditions with concurrent pushes.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between an image manifest and a tag in a container registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An image manifest is an immutable JSON document that describes a specific image: it lists the config blob digest and the ordered list of layer blob digests needed to reconstruct the image filesystem. The manifest itself has a stable, content-addressed digest (e.g., sha256:abc123…) that never changes. A tag is a mutable, human-readable pointer — such as ‘latest’ or ‘v2.4.1’ — that maps to a manifest digest. Tags can be reassigned to a different manifest at any time, which is why production deployments should pin images by digest rather than tag to guarantee reproducibility. A manifest list (or OCI image index) extends this model by acting as a multi-arch manifest whose entries each point to a platform-specific manifest.”
}
}
]
}