Distributed file systems store files across multiple machines, providing storage capacity and throughput far beyond a single server. HDFS (Hadoop Distributed File System) and its predecessor GFS (Google File System) are the backbone of big data processing. This guide covers the architecture of HDFS — metadata management, block replication, fault tolerance, and the design decisions that enable petabyte-scale storage — essential for system design interviews involving data infrastructure.
HDFS Architecture: NameNode and DataNode
HDFS uses a master-worker architecture: (1) NameNode (master) — manages the filesystem namespace (directory tree, file metadata, block locations). Stores: file name -> list of block IDs, block ID -> list of DataNode addresses. The entire namespace is held in memory for fast lookups. A 100-million-file cluster requires approximately 60 GB of NameNode heap. The NameNode is the single point of failure — HDFS HA (High Availability) runs two NameNodes (active + standby) with shared state via a Journal Nodes quorum. (2) DataNodes (workers) — store the actual data blocks on local disks. A typical cluster has hundreds to thousands of DataNodes. Each DataNode sends heartbeats to the NameNode every 3 seconds (proving it is alive) and block reports every 6 hours (listing all blocks it stores). Write path: the client asks the NameNode for a list of DataNodes to write to. The client writes the first block to DataNode 1, which pipelines it to DataNode 2, which pipelines to DataNode 3 (replication factor 3). After all replicas are written, the NameNode records the block locations. Read path: the client asks the NameNode for block locations, then reads directly from the nearest DataNode.
Block Storage and Replication
Files are split into fixed-size blocks (default 128 MB in HDFS, 64 MB in GFS). Why large blocks: (1) Reduces NameNode memory — fewer blocks to track. A 1 TB file = 8,192 blocks at 128 MB vs 1 million blocks at 1 MB. (2) Amortizes disk seek time — sequential reads of a 128 MB block spend most time on transfer, not seeking. (3) Reduces client-NameNode communication — fewer blocks means fewer metadata lookups. Replication: each block is replicated to 3 DataNodes by default (configurable). Replica placement with rack awareness: (1) First replica on the local DataNode (same machine as the writer). (2) Second replica on a DataNode in a different rack (protects against rack failure — power supply, switch). (3) Third replica on a different DataNode in the same rack as the second (maximizes availability without additional cross-rack bandwidth). If a DataNode fails (no heartbeat for 10 minutes), the NameNode re-replicates its blocks to other DataNodes to maintain the replication factor. This self-healing is automatic and continuous.
Consistency Model and Write Semantics
HDFS optimizes for large, sequential writes and reads. It does NOT support random writes or in-place updates. Files are write-once, read-many. Once a file is closed, it cannot be modified (only appended or deleted). This simplifies replication and consistency: no need for distributed locking or conflict resolution on concurrent writes. Write pipeline: the client writes data to a local buffer. When the buffer reaches one block size (128 MB), the client requests a block allocation from the NameNode. The NameNode returns a list of DataNodes. The client sends the block to the first DataNode, which forwards to the second, which forwards to the third (pipeline replication). Each DataNode acknowledges after writing to local disk. After all acknowledgments, the client reports success to the NameNode. Append semantics (HDFS 2.x+): a file can be opened for append. New data is added to the end. Concurrent appends by multiple writers are not supported (one writer at a time). For applications needing random writes, use HBase (built on top of HDFS) which provides row-level read/write operations with an LSM-tree storage engine.
Fault Tolerance and Data Integrity
HDFS is designed for commodity hardware that fails regularly. In a 10,000-node cluster, multiple disks and nodes fail every day. Fault tolerance mechanisms: (1) Block replication (3 replicas across racks) ensures data survives single-node and single-rack failures. (2) Checksums — each block has a CRC32 checksum computed at write time. DataNodes verify checksums on every read. If corruption is detected, the client transparently reads from another replica and the corrupted replica is deleted and re-replicated. (3) NameNode HA — two NameNodes (active + standby) share state via Journal Nodes (a Raft-like quorum). If the active fails, the standby takes over in seconds. Fencing ensures only one NameNode is active at a time (prevents split-brain). (4) Safemode — on startup, the NameNode waits until it receives block reports from enough DataNodes to verify that most blocks meet the minimum replication factor. Until then, the filesystem is read-only (safe mode). Erasure coding (HDFS 3.x): instead of 3x replication (200% overhead), use Reed-Solomon erasure coding (e.g., 6 data + 3 parity = 50% overhead). Saves significant storage at the cost of higher CPU for encoding/decoding. Suitable for cold data accessed infrequently.
HDFS in Modern Data Architecture
HDFS was the storage layer for the Hadoop ecosystem: MapReduce, Hive, Spark, and HBase all read/write HDFS. In modern architectures, cloud object storage (S3, GCS, ABFS) has largely replaced HDFS for new deployments. Why the shift: (1) Separation of compute and storage — with HDFS, compute and storage are on the same nodes. Scaling storage requires adding compute (expensive). Cloud object storage scales independently. (2) Cost — S3 at $0.023/GB/month is cheaper than provisioning DataNode servers with local disks. (3) Durability — S3 offers 11 nines of durability via erasure coding across availability zones. HDFS requires manual management of replication and rack awareness. (4) Elasticity — cloud compute (EMR, Dataproc) spins up, processes data in S3, and shuts down. No idle HDFS cluster consuming resources. When HDFS is still relevant: on-premise deployments where cloud migration is not feasible, low-latency data processing (HDFS data locality means processing happens where the data is), and legacy Hadoop ecosystems. In system design interviews: mention HDFS when discussing big data processing architecture, but note that modern cloud-native architectures use S3 + Spark/Flink for the same workloads.