System Design: Distributed File System (HDFS/GFS) — NameNode, DataNode, Block Replication, Rack Awareness, Hadoop

Distributed file systems store files across multiple machines, providing storage capacity and throughput far beyond a single server. HDFS (Hadoop Distributed File System) and its predecessor GFS (Google File System) are the backbone of big data processing. This guide covers the architecture of HDFS — metadata management, block replication, fault tolerance, and the design decisions that enable petabyte-scale storage — essential for system design interviews involving data infrastructure.

HDFS Architecture: NameNode and DataNode

HDFS uses a master-worker architecture: (1) NameNode (master) — manages the filesystem namespace (directory tree, file metadata, block locations). Stores: file name -> list of block IDs, block ID -> list of DataNode addresses. The entire namespace is held in memory for fast lookups. A 100-million-file cluster requires approximately 60 GB of NameNode heap. The NameNode is the single point of failure — HDFS HA (High Availability) runs two NameNodes (active + standby) with shared state via a Journal Nodes quorum. (2) DataNodes (workers) — store the actual data blocks on local disks. A typical cluster has hundreds to thousands of DataNodes. Each DataNode sends heartbeats to the NameNode every 3 seconds (proving it is alive) and block reports every 6 hours (listing all blocks it stores). Write path: the client asks the NameNode for a list of DataNodes to write to. The client writes the first block to DataNode 1, which pipelines it to DataNode 2, which pipelines to DataNode 3 (replication factor 3). After all replicas are written, the NameNode records the block locations. Read path: the client asks the NameNode for block locations, then reads directly from the nearest DataNode.

Block Storage and Replication

Files are split into fixed-size blocks (default 128 MB in HDFS, 64 MB in GFS). Why large blocks: (1) Reduces NameNode memory — fewer blocks to track. A 1 TB file = 8,192 blocks at 128 MB vs 1 million blocks at 1 MB. (2) Amortizes disk seek time — sequential reads of a 128 MB block spend most time on transfer, not seeking. (3) Reduces client-NameNode communication — fewer blocks means fewer metadata lookups. Replication: each block is replicated to 3 DataNodes by default (configurable). Replica placement with rack awareness: (1) First replica on the local DataNode (same machine as the writer). (2) Second replica on a DataNode in a different rack (protects against rack failure — power supply, switch). (3) Third replica on a different DataNode in the same rack as the second (maximizes availability without additional cross-rack bandwidth). If a DataNode fails (no heartbeat for 10 minutes), the NameNode re-replicates its blocks to other DataNodes to maintain the replication factor. This self-healing is automatic and continuous.

Consistency Model and Write Semantics

HDFS optimizes for large, sequential writes and reads. It does NOT support random writes or in-place updates. Files are write-once, read-many. Once a file is closed, it cannot be modified (only appended or deleted). This simplifies replication and consistency: no need for distributed locking or conflict resolution on concurrent writes. Write pipeline: the client writes data to a local buffer. When the buffer reaches one block size (128 MB), the client requests a block allocation from the NameNode. The NameNode returns a list of DataNodes. The client sends the block to the first DataNode, which forwards to the second, which forwards to the third (pipeline replication). Each DataNode acknowledges after writing to local disk. After all acknowledgments, the client reports success to the NameNode. Append semantics (HDFS 2.x+): a file can be opened for append. New data is added to the end. Concurrent appends by multiple writers are not supported (one writer at a time). For applications needing random writes, use HBase (built on top of HDFS) which provides row-level read/write operations with an LSM-tree storage engine.

Fault Tolerance and Data Integrity

HDFS is designed for commodity hardware that fails regularly. In a 10,000-node cluster, multiple disks and nodes fail every day. Fault tolerance mechanisms: (1) Block replication (3 replicas across racks) ensures data survives single-node and single-rack failures. (2) Checksums — each block has a CRC32 checksum computed at write time. DataNodes verify checksums on every read. If corruption is detected, the client transparently reads from another replica and the corrupted replica is deleted and re-replicated. (3) NameNode HA — two NameNodes (active + standby) share state via Journal Nodes (a Raft-like quorum). If the active fails, the standby takes over in seconds. Fencing ensures only one NameNode is active at a time (prevents split-brain). (4) Safemode — on startup, the NameNode waits until it receives block reports from enough DataNodes to verify that most blocks meet the minimum replication factor. Until then, the filesystem is read-only (safe mode). Erasure coding (HDFS 3.x): instead of 3x replication (200% overhead), use Reed-Solomon erasure coding (e.g., 6 data + 3 parity = 50% overhead). Saves significant storage at the cost of higher CPU for encoding/decoding. Suitable for cold data accessed infrequently.

HDFS in Modern Data Architecture

HDFS was the storage layer for the Hadoop ecosystem: MapReduce, Hive, Spark, and HBase all read/write HDFS. In modern architectures, cloud object storage (S3, GCS, ABFS) has largely replaced HDFS for new deployments. Why the shift: (1) Separation of compute and storage — with HDFS, compute and storage are on the same nodes. Scaling storage requires adding compute (expensive). Cloud object storage scales independently. (2) Cost — S3 at $0.023/GB/month is cheaper than provisioning DataNode servers with local disks. (3) Durability — S3 offers 11 nines of durability via erasure coding across availability zones. HDFS requires manual management of replication and rack awareness. (4) Elasticity — cloud compute (EMR, Dataproc) spins up, processes data in S3, and shuts down. No idle HDFS cluster consuming resources. When HDFS is still relevant: on-premise deployments where cloud migration is not feasible, low-latency data processing (HDFS data locality means processing happens where the data is), and legacy Hadoop ecosystems. In system design interviews: mention HDFS when discussing big data processing architecture, but note that modern cloud-native architectures use S3 + Spark/Flink for the same workloads.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does HDFS store and replicate data across a cluster?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”HDFS splits files into fixed-size blocks (128 MB default) and replicates each block to 3 DataNodes. Rack-aware placement: first replica on the local DataNode, second on a DataNode in a different rack (protects against rack failure), third on another DataNode in the same rack as the second. Write path: client asks NameNode for DataNode list, writes block to first DataNode, which pipelines to second, then third. All replicas must acknowledge before write succeeds. If a DataNode fails (no heartbeat for 10 minutes), the NameNode automatically re-replicates its blocks to maintain the replication factor. The NameNode is the metadata server storing the entire namespace in memory (file -> blocks, block -> DataNode locations). HDFS 3.x supports erasure coding (6 data + 3 parity chunks) for cold data, reducing storage overhead from 200% (3x replication) to 50%.”}},{“@type”:”Question”,”name”:”Why does HDFS use 128 MB blocks instead of smaller sizes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Large blocks provide three benefits: (1) Reduced NameNode memory — the NameNode stores metadata for every block in memory. A 1 TB file = 8,192 blocks at 128 MB vs 1 million blocks at 1 MB. Fewer blocks means less memory and faster metadata operations. (2) Amortized seek time — disk reads have a fixed seek cost (~10ms). For a 128 MB block read at 100 MB/s, seek is 10ms and transfer is 1.3 seconds. Seek is less than 1% of total time. For a 1 MB block, seek is 10ms and transfer is 10ms — seek is 50% of total time. (3) Fewer client-NameNode RPCs — each block requires a metadata lookup. Fewer blocks means fewer round-trips. The tradeoff: small files waste space (a 1 KB file occupies one 128 MB block allocation, though only 1 KB of disk). HDFS is optimized for large files, not millions of small files.”}},{“@type”:”Question”,”name”:”How has cloud object storage replaced HDFS in modern architectures?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cloud object storage (S3, GCS) has largely replaced HDFS for new deployments because: (1) Separation of compute and storage — HDFS ties compute and storage to the same nodes. Scaling storage requires adding compute. S3 scales independently. (2) Cost — S3 at $0.023/GB/month is cheaper than provisioning DataNode servers. (3) Durability — S3 offers 11 nines via cross-AZ erasure coding. HDFS requires manual replication management. (4) Elasticity — cloud compute (EMR, Dataproc) processes S3 data and shuts down. No idle cluster. Modern architecture: store data in S3 as Parquet/Delta Lake files. Process with Spark on ephemeral clusters. Query with Athena, Presto, or Trino. HDFS is still relevant for on-premise deployments and workloads requiring data locality (processing at the DataNode where data is stored).”}},{“@type”:”Question”,”name”:”How does HDFS handle DataNode failures?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”HDFS is designed for commodity hardware where failures are expected. Detection: DataNodes send heartbeats to the NameNode every 3 seconds. If no heartbeat for 10 minutes, the DataNode is declared dead. Recovery: the NameNode identifies all blocks that were on the failed DataNode (from the block-to-DataNode mapping). For each under-replicated block, it selects a healthy DataNode that has a replica and instructs it to replicate the block to a new DataNode, restoring the target replication factor. This is automatic and continuous. Data integrity: each block has a CRC32 checksum. DataNodes verify checksums on every read. If corruption is detected, the client reads from another replica, and the corrupted block is deleted and re-replicated from a healthy replica. The NameNode itself is protected by HA (active + standby with shared Journal Nodes). Fencing ensures only one NameNode is active to prevent split-brain.”}}]}
Scroll to Top