Question 1

How does HDFS store and replicate data across a cluster?

Accepted Answer

HDFS splits files into fixed-size blocks (128 MB default) and replicates each block to 3 DataNodes. Rack-aware placement: first replica on the local DataNode, second on a DataNode in a different rack (protects against rack failure), third on another DataNode in the same rack as the second. Write path: client asks NameNode for DataNode list, writes block to first DataNode, which pipelines to second, then third. All replicas must acknowledge before write succeeds. If a DataNode fails (no heartbeat for 10 minutes), the NameNode automatically re-replicates its blocks to maintain the replication factor. The NameNode is the metadata server storing the entire namespace in memory (file -> blocks, block -> DataNode locations). HDFS 3.x supports erasure coding (6 data + 3 parity chunks) for cold data, reducing storage overhead from 200% (3x replication) to 50%.

Question 2

Why does HDFS use 128 MB blocks instead of smaller sizes?

Accepted Answer

Large blocks provide three benefits: (1) Reduced NameNode memory -- the NameNode stores metadata for every block in memory. A 1 TB file = 8,192 blocks at 128 MB vs 1 million blocks at 1 MB. Fewer blocks means less memory and faster metadata operations. (2) Amortized seek time -- disk reads have a fixed seek cost (~10ms). For a 128 MB block read at 100 MB/s, seek is 10ms and transfer is 1.3 seconds. Seek is less than 1% of total time. For a 1 MB block, seek is 10ms and transfer is 10ms -- seek is 50% of total time. (3) Fewer client-NameNode RPCs -- each block requires a metadata lookup. Fewer blocks means fewer round-trips. The tradeoff: small files waste space (a 1 KB file occupies one 128 MB block allocation, though only 1 KB of disk). HDFS is optimized for large files, not millions of small files.

Question 3

How has cloud object storage replaced HDFS in modern architectures?

Accepted Answer

Cloud object storage (S3, GCS) has largely replaced HDFS for new deployments because: (1) Separation of compute and storage -- HDFS ties compute and storage to the same nodes. Scaling storage requires adding compute. S3 scales independently. (2) Cost -- S3 at $0.023/GB/month is cheaper than provisioning DataNode servers. (3) Durability -- S3 offers 11 nines via cross-AZ erasure coding. HDFS requires manual replication management. (4) Elasticity -- cloud compute (EMR, Dataproc) processes S3 data and shuts down. No idle cluster. Modern architecture: store data in S3 as Parquet/Delta Lake files. Process with Spark on ephemeral clusters. Query with Athena, Presto, or Trino. HDFS is still relevant for on-premise deployments and workloads requiring data locality (processing at the DataNode where data is stored).

Question 4

How does HDFS handle DataNode failures?

Accepted Answer

HDFS is designed for commodity hardware where failures are expected. Detection: DataNodes send heartbeats to the NameNode every 3 seconds. If no heartbeat for 10 minutes, the DataNode is declared dead. Recovery: the NameNode identifies all blocks that were on the failed DataNode (from the block-to-DataNode mapping). For each under-replicated block, it selects a healthy DataNode that has a replica and instructs it to replicate the block to a new DataNode, restoring the target replication factor. This is automatic and continuous. Data integrity: each block has a CRC32 checksum. DataNodes verify checksums on every read. If corruption is detected, the client reads from another replica, and the corrupted block is deleted and re-replicated from a healthy replica. The NameNode itself is protected by HA (active + standby with shared Journal Nodes). Fencing ensures only one NameNode is active to prevent split-brain.

System Design: Distributed File System (HDFS/GFS) — NameNode, DataNode, Block Replication, Rack Awareness, Hadoop

HDFS Architecture: NameNode and DataNode

Block Storage and Replication

Consistency Model and Write Semantics

Fault Tolerance and Data Integrity

HDFS in Modern Data Architecture