Low Level Design: Distributed File System

⏱ 5 min read

A distributed file system (DFS) stores data across multiple machines while presenting a unified namespace to clients. Google File System (GFS) and its open-source counterpart HDFS defined the blueprint for large-scale distributed storage. Understanding their internals is essential for system design interviews and building reliable data infrastructure.

GFS Architecture: Master and Chunkservers

GFS uses a single master and multiple chunkservers. The master stores all file system metadata: the namespace (directory tree), the mapping of file names to chunk handles, and the locations of chunk replicas across chunkservers. Chunkservers store fixed-size 64 MB chunks on local disk as plain Linux files. Clients contact the master for metadata, then communicate directly with chunkservers for data — the master is never in the data path. This separation keeps the master lightweight and allows data transfers to scale horizontally.

HDFS Namenode and Datanode Design

HDFS is the Hadoop equivalent of GFS. The namenode holds all metadata in memory: file-to-block mappings, block locations, permissions, and the directory tree. Keeping metadata in RAM allows O(1) lookups and avoids disk I/O on every request. Datanodes store actual data blocks on local disk and serve read/write requests directly from clients. The namenode never stores block contents — it only tracks where blocks live. Memory capacity of the namenode limits the total number of files and blocks the cluster can hold, which is why HDFS is optimized for large files over millions of small files.

Chunk Replication for Fault Tolerance

Each chunk is replicated across three datanodes by default. The replication factor is configurable per file. On a write, the client sends data to a pipeline of datanodes; each datanode forwards to the next. The master/namenode monitors replica counts and triggers re-replication when a datanode dies or a replica is found corrupt. Three-way replication tolerates two simultaneous node failures without data loss. The master never stores replicas itself — its job is bookkeeping and orchestration.

Write Pipeline: Bandwidth-Efficient Data Path

When a client writes data, it first asks the master which chunkserver holds the lease (the primary replica). The client then pushes data to all replicas in a linear pipeline: client → replica 1 → replica 2 → replica 3. Each node forwards bytes to the next as it receives them, overlapping network transfers. This pipelining maximizes network bandwidth: instead of the client sending three separate copies, each network link carries the data once. The primary replica assigns a serial number to each mutation and forwards the operation to secondaries; all replicas apply mutations in the same order. The client receives acknowledgment only after all replicas confirm the write.

Rack Awareness: Surviving Rack Switch Failures

A single rack failure (switch or power) can take down dozens of nodes simultaneously. Rack-aware replica placement addresses this: GFS/HDFS places one replica on the local rack, a second on a different rack, and the third on a different node within that second rack. This placement policy ensures that even a complete rack failure leaves at least one replica available. It also balances write bandwidth (one local write, one cross-rack write) against fault tolerance. The namenode maintains rack topology information configured by the cluster administrator via a topology script or network mapping file.

Namenode High Availability: Eliminating the Single Point of Failure

The namenode is the classic single point of failure in HDFS. HDFS HA solves this with an active-standby namenode pair managed by ZooKeeper. Both namenodes share a Quorum Journal Manager (QJM) — a cluster of journal nodes that store the edit log. The active namenode writes every metadata change to the QJM; the standby replays those edits in real time, keeping its state synchronized. ZooKeeper provides leader election: if the active namenode fails, ZooKeeper detects the lost heartbeat and promotes the standby. Fencing (STONITH or SSH kill) ensures the old active cannot corrupt shared state after failover. Clients see no interruption beyond a brief pause.

Block Reports and Heartbeats

Datanodes send a heartbeat to the namenode every 3 seconds to signal liveness. The namenode marks a datanode dead if no heartbeat arrives within 10 minutes (configurable). On startup, and then every 6 hours, each datanode sends a full block report listing all blocks it holds. The namenode uses block reports to build and maintain its block-location map. When a datanode goes missing, the namenode identifies under-replicated blocks and schedules re-replication on surviving datanodes. This reactive approach means the cluster self-heals without manual intervention.

Metadata Persistence: fsimage and Edit Log

The namenode persists metadata to disk in two structures: the fsimage (a complete snapshot of the namespace at a point in time) and the edit log (a journal of every subsequent change). On startup, the namenode loads the fsimage and replays the edit log to reconstruct current state. Over time the edit log grows large, so a secondary namenode (or standby in HA mode) periodically merges fsimage + edit log into a new fsimage — called a checkpoint. Without checkpointing, startup recovery becomes slow. In HA mode, the standby namenode performs checkpointing continuously, keeping the edit log short and startup fast.

Large File Optimization and Small File Problem

GFS and HDFS are explicitly optimized for large sequential reads and appends — multi-gigabyte files read end-to-end or appended continuously (log files, MapReduce input). The 64 MB chunk size amortizes per-chunk overhead over large transfers. Small random reads and writes are poorly served: a 1 KB file still consumes a full chunk slot and a namenode metadata entry. Clusters with millions of small files exhaust namenode heap memory and create excessive metadata overhead. Solutions include packing small files into sequence files or HAR archives, or using an object store (S3/GCS) which handles small objects more efficiently.

Leases, Concurrent Writes, and Checksums

GFS issues a lease to one chunkserver (the primary) for each chunk being mutated. The lease grants the primary authority to define mutation order for a fixed period (typically 60 seconds, renewable). This prevents two clients from writing conflicting data to the same chunk without coordination. The master revokes leases on failure or when it needs to redirect writes. For data integrity, each chunk is divided into 64 KB blocks; each block has a 32-bit checksum stored separately on the chunkserver. On every read the chunkserver verifies the checksum before returning data. Corruption is detected, logged, and reported to the master, which then re-replicates a healthy copy and schedules removal of the corrupt replica.