Low Level Design: File System Design and Internals

⏱ 4 min read

A file system organizes data on storage media into files and directories, providing naming, access control, and efficient storage allocation. Understanding file system internals — inodes, directory structure, journaling, copy-on-write — explains the behavior of databases, Docker layers, and distributed storage systems. ext4, XFS, btrfs, and ZFS each implement different tradeoffs for reliability, performance, and features. For distributed file systems: HDFS, GFS, and S3 extend these concepts to clusters of machines. File system internals frequently appear in system design interviews for storage-intensive systems.

Inode Structure and File Layout

In Unix file systems, every file and directory is represented by an inode (index node). An inode stores file metadata: file type, permissions, owner, timestamps (atime, mtime, ctime), size, and pointers to data blocks. The inode does not store the file name — names are stored in directory entries that map names to inode numbers. A directory is a special file containing a list of (name, inode_number) pairs. Hard links: multiple directory entries pointing to the same inode. The inode reference count tracks how many links exist; the file is deleted only when the count reaches zero. Symbolic links: a file containing a path string pointing to another file. Data block pointers: ext2/ext3 use a hybrid tree (direct blocks, single/double/triple indirect blocks for large files). ext4 uses extents (contiguous block ranges) for efficiency.

# Linux inode information
stat /etc/passwd
# File: /etc/passwd
# Size: 2847      Blocks: 8         IO Block: 4096   regular file
# Device: 803h    Inode: 1048577    Links: 1
# Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
# Access: 2024-01-15 10:23:11  Modify: 2024-01-10 09:15:44
# Change: 2024-01-10 09:15:44  Birth: 2024-01-01 00:00:00

# Inode usage: a disk can run out of inodes before space
df -i /  # show inode usage

# Page cache: Linux caches file data in RAM
# read() returns from page cache if the page is present (no disk I/O)
# write() goes to page cache first, kernel flushes dirty pages to disk periodically
# fsync() forces flush of a specific file to disk (WAL pattern)
# O_DIRECT: bypass page cache, read/write directly to disk (used by databases)

# ext4 journaling modes:
# writeback: journal metadata only, data may be out of order (fastest, less safe)
# ordered: journal metadata, write data to disk first (default, good balance)
# journal: journal both data and metadata (safest, slowest)

Journaling and Write-Ahead Logging

Without journaling, a crash during a multi-step operation (creating a file requires writing inode, directory entry, and data blocks) leaves the file system in an inconsistent state. fsck (file system check) repairs this but takes hours on large disks. Journaling writes a record of pending operations to a journal (circular log) before applying them to the main file system. On crash recovery, the journal is replayed — any incomplete operation is either completed or rolled back. This brings recovery time from hours (fsck) to seconds (replay journal). ext4, XFS, and NTFS all use journaling. Copy-on-write (CoW) file systems (btrfs, ZFS): instead of modifying blocks in place, write new versions to free space, then atomically update the tree root pointer. No journal needed — the file system is always consistent at any snapshot point. CoW enables efficient snapshots and checksumming.

Distributed File Systems: HDFS and GFS

HDFS (Hadoop Distributed File System) and GFS (Google File System) store large files across commodity machines. Architecture: one NameNode (master) stores the file namespace (inode tree, block locations); multiple DataNodes store actual data blocks (typically 128MB each). File write: client splits file into blocks, uploads each block to 3 DataNodes (replication factor 3 for fault tolerance), DataNodes acknowledge. File read: client queries NameNode for block locations, reads from the closest DataNode. Design assumptions: files are large (GB to TB), written once and read many times, sequential access dominates (not random access). The NameNode is a single point of failure — HDFS HA mode uses two NameNodes with shared edit log (ZooKeeper for failover). Object storage (S3, GCS) replaces HDFS in modern data lakes: cheaper, infinitely scalable, HTTP API, no NameNode bottleneck.

Key Interview Discussion Points

Page cache: Linux caches file data in the page cache (RAM); read() checks the cache before going to disk; write() goes to cache first and is asynchronously flushed to disk; databases use O_DIRECT to bypass the page cache (they manage their own buffer pool)
mmap: memory-mapped files map file contents into the process address space; accessing the memory triggers page faults that load from disk on demand; used by SQLite, LMDB, and memory-mapped B-tree indexes for zero-copy reads
Copy-on-write for containers: Docker layers use overlayfs (copy-on-write); each layer contains only the changes from the layer below; a new container starts from the read-only layer stack with a thin read-write layer on top
Sparse files: files with large gaps of zeros do not allocate disk blocks for the zero regions; useful for virtual disk images, database pre-allocated files; reported size (stat) differs from actual disk usage (du)
ZFS features: end-to-end checksums on all data (detects silent disk corruption), transparent compression, integrated RAID (RAID-Z), snapshots and clones at zero cost, self-healing with redundant copies; used by FreeBSD, Solaris, and increasingly Linux NAS systems