Question 1

How do you address the metadata server single point of failure?

Accepted Answer

The classic HDFS architecture uses a single NameNode which is a single point of failure — if the NameNode is unavailable, the entire filesystem is inaccessible even though data nodes are healthy. Mitigations include: (1) Active-Standby NameNode with shared edit log via a Quorum Journal Manager (QJM) — the standby replays the edit log continuously and can take over in seconds; (2) Secondary NameNode (checkpoint node) that periodically checkpoints the in-memory inode tree to disk, reducing recovery time after a crash; (3) Federation — multiple independent NameNodes each owning a namespace subtree, eliminating a single bottleneck. Modern Kubernetes-native DFS designs (like Ceph) use distributed metadata via RADOS for full metadata HA without a single coordinator.

Question 2

How does rack-aware replication improve fault tolerance?

Accepted Answer

Without rack awareness, all three replicas of a chunk might land on data nodes within the same rack. A single top-of-rack switch failure would make all three replicas unavailable simultaneously, causing data unavailability even though replication factor is 3. Rack-aware replication places replicas on nodes in different physical racks: the HDFS default policy places the first replica on the same node as the writer (or a local node), the second replica on a different rack, and the third replica on a different node in the same rack as the second. This layout tolerates both a rack-level failure (entire rack goes down) and a node-level failure within a rack simultaneously.

Question 3

How does the filesystem detect data node failures and trigger re-replication?

Accepted Answer

Data nodes send a heartbeat to the metadata server every 3 seconds. If the metadata server does not receive a heartbeat from a data node for a configurable timeout (typically 10 minutes in HDFS to avoid false positives from slow nodes), the data node is declared dead. The metadata server scans all chunks that had a replica on the dead node and identifies those that now have fewer than the target replication factor. For each under-replicated chunk, the metadata server selects a healthy source data node that has a replica and a healthy destination node, instructs the source to pipeline a copy to the destination, and marks the chunk as under-replication-in-progress. Once the new replica is confirmed (via block report), the chunk is marked fully replicated again.

Question 4

What is the small file problem in distributed file systems and how is it mitigated?

Accepted Answer

In a DFS with large chunk sizes (64MB or 128MB), storing millions of small files (KB to a few MB each) wastes storage because each file occupies at least one full chunk. More critically, each small file requires an inode entry in the metadata server's in-memory inode tree. With millions of files, the metadata server memory footprint becomes the bottleneck — HDFS recommends no more than 150 million files on a single NameNode with 64 GB RAM. Mitigations include: (1) HAR (Hadoop Archive) files — packing many small files into a large archive with its own index; (2) sequence files — combining many small records into a single large file with an embedded index; (3) using an object storage system (designed for arbitrary file sizes) for small file workloads instead of a block-oriented DFS.

Question 5

How does the metadata server avoid becoming a single point of failure?

Accepted Answer

The metadata server can be made highly available with primary-standby replication and automatic failover; a distributed alternative uses consensus (Raft) across 3+ metadata nodes so any majority can serve requests.

Question 6

How does rack-aware replication protect against rack-level failures?

Accepted Answer

When placing 3 replicas of a chunk, the system ensures at most one replica per rack; this means a single rack power or network failure cannot cause data loss even if all machines in that rack go offline simultaneously.

Question 7

How is re-replication triggered after a data node failure?

Accepted Answer

The metadata server detects a data node as failed when its heartbeat has not been received for a configurable timeout; it then identifies all chunks whose replica count dropped below the target and schedules re-replication to healthy nodes.

Question 8

What is the small file problem and how is it mitigated?

Accepted Answer

Storing many small files (less than the chunk size) wastes storage and creates metadata overhead; mitigations include file bundling (packing many small files into one chunk with an index), or using a separate fast metadata store optimized for small objects.

Distributed File System Low-Level Design: Metadata Server, Data Nodes, Replication, and Fault Tolerance

Architecture: Metadata Server and Data Nodes

File and Chunk Model

Write Path: Client → Primary → Secondaries

Read Path: Client → Nearest Data Node

Heartbeat and Failure Detection

Rack-Aware Replication

SQL DDL

Python: Core Operations

Design Considerations Summary