How is a hierarchical namespace tree stored and queried efficiently?

Each node (file or directory) is a row with a parent_id foreign key, forming an adjacency-list tree. Path lookups are accelerated by storing the materialized path string (e.g., /a/b/c) as an indexed column, enabling prefix-range scans for subtree listings. Alternatively, a closure table stores all ancestor-descendant pairs for O(1) ancestry checks. Rename operations update the materialized path of the node and all descendants in a single batch.

How does a Unix-style permission model apply to a file metadata service?

Each inode stores owner UID, group GID, and a 9-bit mode (rwx for owner, group, others). Permission checks traverse the caller's effective UID/GID against the inode's fields in order: owner bits first, then group, then other. Setuid/setgid and sticky bits are supported for directories. For cloud services this is augmented with ACL entries (additional named users or groups) stored in a separate acl table joined on inode ID.

How is quota enforcement implemented without becoming a write bottleneck?

Quota usage is tracked per-user and per-project in a dedicated quota table updated transactionally alongside metadata writes. To avoid lock contention, usage deltas are accumulated in a write-ahead buffer and flushed periodically, with a small over-quota tolerance to absorb bursts. Hard limits check the current usage at write time; soft limits trigger warnings. Background reconciliation jobs periodically recompute actual usage from the namespace tree to correct any drift from failed transactions.

How do you paginate bulk directory listings efficiently?

Listings use keyset (cursor) pagination rather than OFFSET to avoid full scans. The response includes a continuation token encoding the last-seen (parent_id, name) pair. The next page query adds a WHERE clause filtering on that composite key, using the directory's B-tree index to resume from exactly where the previous page ended. Page size is capped (e.g., 1000 entries) and results are sorted by name to ensure a stable, reproducible order across pages.

File Metadata Service Low-Level Design: Namespace Tree, Permission Model, and Quota Enforcement

⏱ 6 min read

What Is a File Metadata Service?

A file metadata service manages the namespace, directory hierarchy, permissions, and quota accounting for a distributed file system or cloud storage product. It does not store file data — that is delegated to a blob or chunk store — but it answers questions like: does this path exist, who can access it, what are its attributes, and how much quota has been consumed? Getting this service right is critical because every file operation (read, write, list, stat) touches metadata, making it a central latency and scalability bottleneck.

Requirements

Functional Requirements

Hierarchical namespace: directories contain files and subdirectories; paths are resolved by traversing the tree.
Unix-style permission model: owner, group, and other with read/write/execute bits; optional ACLs for fine-grained sharing.
Quota enforcement: track storage used per user and per namespace; reject writes that would exceed quota.
Bulk listing: list directory contents efficiently, supporting pagination for directories with millions of entries.
Atomic rename: move a file or directory atomically, including cross-directory moves.

Non-Functional Requirements

Stat and permission check latency under 5ms at the 99th percentile.
List operations must paginate to avoid unbounded response sizes.
Quota updates must be consistent: no double-counting or missed accounting on concurrent writes.
Scalable to billions of inodes across millions of namespaces.

Data Model

Inode — inode_id (PK, UUID), namespace_id, type (FILE/DIR), name, parent_inode_id, owner_uid, owner_gid, mode (Unix permission bits), size_bytes, block_count, created_at, modified_at, accessed_at, blob_key (for FILE type).
DirEntry — parent_inode_id + name (composite PK), child_inode_id — supports efficient directory listing by parent.
ACLEntry — inode_id, principal_type (USER/GROUP), principal_id, permissions (bitmask).
QuotaLedger — namespace_id (PK), owner_uid (PK), bytes_used, inode_count, quota_bytes, quota_inodes, updated_at.

Inode and DirEntry records live in a distributed database partitioned by namespace_id. QuotaLedger uses serializable transactions to prevent quota overrun under concurrent writes.

Core Algorithms

Path Resolution

Path resolution walks the directory tree component by component starting from the root inode of the namespace. For /a/b/c.txt the resolver fetches the root inode, then looks up DirEntry(root, “a”) to get the inode for directory a, then DirEntry(a, “b”), then DirEntry(b, “c.txt”). Each lookup is a point read by composite primary key. Resolution depth is bounded by the maximum path depth (configurable, e.g., 255 components). Path components are cached in a local LRU cache to amortize repeated resolutions of common prefixes.

Unix Permission Check

For each operation the service evaluates permissions in order: if the requester is the inode owner, apply the owner bits; if the requester belongs to the inode group, apply the group bits; otherwise apply the other bits. If ACLs are present they are checked after the standard Unix bits and may grant or restrict access beyond the mode bits. Permission checks for a full path require checking execute (x) permission on every directory in the path, not just the target inode.

Quota Accounting

On every write operation (file create, file grow, directory create) the service debits the QuotaLedger within the same database transaction as the Inode creation or update. If the debit would push bytes_used above quota_bytes the transaction is rolled back and the write is rejected with a quota exceeded error. On delete, the quota is credited. Using the same transaction for metadata change and quota update ensures no bytes are ever counted without a corresponding inode or charged without storage being consumed.

API Design

POST /namespaces/{ns}/inodes — Create a file or directory inode at a given parent path.
GET /namespaces/{ns}/inodes?path=/a/b — Resolve a path and return the inode attributes.
GET /namespaces/{ns}/inodes/{id}/children?page_token=X&limit=1000 — Paginated directory listing.
PATCH /namespaces/{ns}/inodes/{id}/permissions — Update mode bits or ACL entries.
POST /namespaces/{ns}/inodes/{id}/rename — Atomic rename; body specifies new parent and new name.
GET /namespaces/{ns}/quota/{uid} — Query current quota usage for a user.

Scalability and Reliability

Horizontal Partitioning

Inodes and DirEntries are partitioned by namespace_id. Each namespace maps to a shard group in the distributed database. Large namespaces with billions of inodes may be further sub-sharded by a hash of the parent_inode_id. Cross-shard atomic renames (moving a file between two shards) use a two-phase commit coordinated by the metadata service, which holds a distributed lock on the source and destination during the rename to prevent concurrent modifications.

Bulk Listing Scalability

DirEntry records are stored in sorted order by (parent_inode_id, name) so listing is a range scan rather than a full-table scan. Pagination uses a cursor encoding the last seen name. For directories with millions of entries the scanner returns batches of 1000 entries per page. The caller repeats the request with the cursor until an empty page is returned. Listing is eventually consistent in the presence of concurrent creates and deletes — the page cursor may miss or duplicate entries added during iteration, which is acceptable for most use cases.

Quota Ledger Consistency

QuotaLedger updates use serializable isolation to prevent two concurrent writes from both reading the same bytes_used, both concluding they are under quota, and both succeeding when their combined write would exceed quota. For high-concurrency namespaces, quota is tracked with a distributed counter using CRDTs or a reservations system: writers reserve a quota allocation in advance and release unused reservations periodically, reducing contention on the central ledger row.

Interview Tips

Interviewers probe: how do you handle deep directory trees efficiently? (cache resolved path prefixes; limit max depth). How do you implement shared directories without duplicating inodes? (hard links: multiple DirEntry records point to the same inode_id; link_count on the inode tracks how many DirEntries reference it). How do you prevent quota evasion through hard links? (count bytes against the owner of the inode, not the link, so creating a hard link does not create additional quota charge).