System Design: Docker Containers — Namespaces, Cgroups, Image Layers, Dockerfile Best Practices, Container Security

Containers are the building blocks of modern cloud-native applications. Understanding how Docker works internally — Linux namespaces, cgroups, layered filesystems, and image construction — gives you a deeper understanding of the infrastructure your code runs on. This guide covers container internals, Dockerfile best practices, and container security — essential knowledge for backend engineering and DevOps interviews.

How Containers Work: Namespaces and Cgroups

Containers are not virtual machines. A container is a regular Linux process with two kernel features applied: (1) Namespaces — provide isolation. Each container gets its own view of: PID namespace (the container sees its processes starting from PID 1, isolated from host processes), network namespace (its own network interfaces, IP address, port space), mount namespace (its own filesystem tree), UTS namespace (its own hostname), IPC namespace (isolated inter-process communication), and user namespace (can map container root to an unprivileged host user). The container process believes it is running in its own isolated Linux environment, but it shares the host kernel. (2) Cgroups (control groups) — limit and account for resource usage. A container cgroup constrains: CPU (shares, quota, period — e.g., this container gets at most 2 CPU cores), memory (maximum memory usage — exceeding the limit triggers OOM killer), I/O (block device read/write rate limits), and PIDs (maximum number of processes). Together: namespaces provide isolation (what the process can see), cgroups provide resource limits (what the process can use). No hypervisor, no guest kernel — containers are lightweight because they share the host kernel.

Docker Image Layers and the Union Filesystem

A Docker image is a stack of read-only filesystem layers. Each instruction in a Dockerfile (FROM, RUN, COPY, ADD) creates a new layer. Layers are content-addressed: the layer ID is the SHA256 hash of its contents. Layer reuse: if two images share the same base (FROM python:3.12), that base layer is stored once on disk and shared. This saves disk space and speeds up image pulls (only download layers you do not already have). When a container runs, Docker adds a thin read-write layer on top of the image layers using a union filesystem (overlay2 on modern Linux). Writes go to the read-write layer; reads fall through to the image layers. When the container is deleted, the read-write layer is discarded — the image layers are unchanged. This is why containers are ephemeral: any data written inside the container is lost unless it is written to a volume (a host directory mounted into the container). Image size optimization: minimize the number of layers and the size of each layer. Combine RUN commands with && to reduce layers. Remove temporary files in the same RUN command that creates them (apt-get clean, rm -rf /var/lib/apt/lists/*).

Dockerfile Best Practices

Production Dockerfile patterns: (1) Multi-stage builds — use one stage to build the application and another to run it. The build stage includes compilers, build tools, and dependencies. The run stage contains only the compiled binary and runtime dependencies. This reduces image size from 1GB+ (build stage) to 50-100MB (run stage). (2) Use specific base image tags — FROM python:3.12.3-slim, not FROM python:latest. The latest tag changes unexpectedly and breaks reproducibility. (3) Order instructions by change frequency — put rarely changing instructions (installing system packages) early and frequently changing instructions (copying application code) late. Docker caches layers — changing an early instruction invalidates all subsequent layers. (4) Run as non-root — add USER appuser after creating the user. Running as root inside a container is a security risk (container escape vulnerabilities allow root on the host). (5) Use .dockerignore — exclude .git, node_modules, __pycache__, and other unnecessary files from the build context. This speeds up builds and reduces image size. (6) Health checks — add HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1. Docker and orchestrators use this to detect unhealthy containers.

Container Networking

Docker provides several networking modes: (1) Bridge (default) — containers connect to a virtual bridge network (docker0). Each container gets an IP on the bridge subnet. Containers on the same bridge can communicate. Port mapping (-p 8080:80) exposes container ports to the host. (2) Host — the container shares the host network namespace. No network isolation — the container uses the host IP and ports directly. Higher performance (no NAT overhead) but no port isolation. (3) None — no networking. The container has only a loopback interface. Use for batch processing that does not need network access. (4) Overlay — multi-host networking for Docker Swarm or Kubernetes. VXLAN encapsulation creates a virtual network spanning multiple hosts. Containers on different hosts communicate as if on the same LAN. In Kubernetes, the CNI plugin (Calico, Cilium, Flannel) handles container networking. Each pod gets a unique IP, and pods can communicate across nodes without NAT. Service discovery uses CoreDNS: a Service name resolves to a ClusterIP that load-balances to backend pods.

Container Security

Container security layers: (1) Image scanning — scan images for known vulnerabilities (CVEs) before deployment. Tools: Trivy (open source, fast), Snyk Container, AWS ECR scanning. Integrate scanning into CI/CD — fail the build if critical CVEs are found. (2) Run as non-root — the container process should not run as UID 0. If an attacker escapes the container, they are an unprivileged user on the host. (3) Read-only filesystem — mount the container filesystem as read-only (docker run –read-only). The application writes only to explicitly mounted volumes. This prevents attackers from modifying binaries or installing tools. (4) Seccomp profiles — restrict which system calls the container can make. Docker applies a default seccomp profile that blocks dangerous syscalls (reboot, mount, kexec). Custom profiles can further restrict based on application needs. (5) Network policies — in Kubernetes, NetworkPolicy resources restrict which pods can communicate. Default deny all ingress, then allow specific traffic (pod A can reach pod B on port 8080). This limits lateral movement after a compromise. (6) Secrets management — never bake secrets into images. Use Kubernetes Secrets, Vault, or environment variables injected at runtime.

Container Registry and Image Distribution

A container registry stores and distributes Docker images. Architecture: images are stored as a manifest (JSON describing layers) and blobs (the actual layer data). The Docker Registry HTTP API V2 provides: push (upload image layers and manifest), pull (download layers and manifest), and tag management. Public registries: Docker Hub (default, rate-limited for anonymous pulls), GitHub Container Registry (ghcr.io), Google Container Registry (gcr.io). Private registries: AWS ECR, Google Artifact Registry, Azure Container Registry, or self-hosted (Harbor). Image distribution optimization: (1) Layer deduplication — shared layers are stored once. If 50 images use the same python:3.12 base, the base layers are stored once. (2) Pull-through cache — a local registry caches images from Docker Hub, reducing external bandwidth and pull latency. (3) Image pre-pulling — for large images, pre-pull to cluster nodes before deploying. The Kubernetes scheduler prefers nodes that already have the image cached (ImageLocality scoring). (4) Slim images — use minimal base images (alpine, distroless, scratch for Go binaries) to reduce pull time and attack surface.

Scroll to Top