Merkle Tree: Design and Applications – Tech Interview Dot Org

A Merkle tree is a hash tree where every leaf node contains the hash of a data block, and every non-leaf node contains the hash of its children’s hashes. The root hash (Merkle root) represents the entire dataset — if any data changes, the root hash changes. This property enables efficient verification of data integrity and efficient identification of differences between two datasets without comparing all the data.

Structure and Construction

Given n data blocks: hash each block to produce n leaf hashes. Pair adjacent leaf hashes and hash each pair to produce n/2 parent hashes. Repeat until one root hash remains. If n is odd at any level, duplicate the last hash to make the level even (or handle odd levels explicitly). The result is a complete binary tree where the root summarizes all the data. Rebuilding the root after changing one block requires recomputing only O(log n) hashes — the path from the changed leaf to the root.

Proof of Membership

To prove that block B is in a dataset without revealing the other blocks: provide B, its hash, and the sibling hashes along the path from B’s leaf to the root (the Merkle proof). The verifier hashes B, then combines with each sibling hash up the tree, and checks that the computed root matches the known Merkle root. This requires only O(log n) hashes to prove membership in a dataset of n blocks — no need to share all n blocks. Certificate transparency logs use Merkle proofs to prove that a certificate is in the log without revealing all certificates.

Anti-Entropy: Finding Differences Between Replicas

Cassandra and DynamoDB use Merkle trees to efficiently find data that differs between replicas without transferring all data. Process: each replica builds a Merkle tree over its data (partitioned into hash buckets). Replicas exchange root hashes. If roots match, data is identical — no sync needed. If roots differ: compare child hashes at each level, drilling down to the differing leaves. Only the differing subtrees are transferred and compared. This reduces anti-entropy repair traffic from O(n) (comparing every row) to O(k log n) where k is the number of differing rows.

Blockchain Applications

Bitcoin and Ethereum use Merkle trees to summarize all transactions in a block. The block header contains only the Merkle root — not all transactions. Nodes can verify that a specific transaction is in a block by downloading only the Merkle proof (O(log n) hashes, ~1KB) rather than all transactions in the block (~1MB). This enables lightweight clients (SPV nodes) to verify transaction inclusion without downloading the full blockchain. The Merkle root in the block header is part of the hash chain — tampering with any transaction changes the Merkle root, invalidating the block hash.

Git Object Model

Git uses a Merkle DAG (directed acyclic graph) rather than a pure tree. Each commit object hashes its tree (directory structure), which hashes its blobs (file contents), which hash file data. The commit hash uniquely identifies the entire repository state at that point. Changing any file changes its blob hash, which changes its tree hash, which changes the commit hash. Two commits with identical hashes are guaranteed to have identical contents. This property makes git’s integrity guarantees cryptographic — git push –force cannot silently corrupt history if you verify commit hashes.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does a Merkle tree enable efficient integrity verification?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A Merkle tree hashes each data block into a leaf, then hashes pairs of leaf hashes into parent nodes, up to a single root hash. The root hash cryptographically summarizes the entire dataset — changing any single bit in any block changes the root hash. To verify a specific block: provide the block’s hash and the sibling hashes along the path from that leaf to the root (the Merkle proof). The verifier recomputes the root from bottom up using these O(log n) hashes and checks it matches the trusted root. This proves the block is in the dataset with only O(log n) hashes — not the full n blocks. Changing any block in the dataset produces a different root, making tampering detectable.”} }, { “@type”: “Question”, “name”: “How does Cassandra use Merkle trees for anti-entropy repair?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Cassandra periodically runs anti-entropy repair to synchronize data between replicas. Each replica builds a Merkle tree over its partition range: the key space is divided into hash buckets, each bucket hashed, buckets hashed pairwise up to a root. Replicas exchange root hashes first. If roots match, the replicas are identical — no data transfer needed. If roots differ: the nodes compare child hashes level by level, drilling down only into the subtrees that differ. Only the divergent leaf ranges are examined and their data synchronized. This reduces repair network traffic from O(n) — transferring all data to compare — to O(k log n) where k is the number of differing rows, making repair practical even for large datasets.”} }, { “@type”: “Question”, “name”: “How do blockchains use Merkle trees?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Bitcoin and Ethereum place all transactions in a block into a Merkle tree. The block header stores only the Merkle root — a 32-byte hash. To prove that a specific transaction is in a block (SPV proof), provide the transaction and O(log n) sibling hashes; the verifier recomputes the root and checks it matches the block header’s Merkle root. This lets lightweight clients (mobile wallets, SPV nodes) verify transaction inclusion by downloading ~1KB of proof data rather than the full block (~1MB). The Merkle root in the block header is part of the block hash chain — any transaction tampering changes the Merkle root, invalidating the block hash and breaking the blockchain.”} }, { “@type”: “Question”, “name”: “How does Git use a Merkle DAG?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Git stores objects in a Merkle DAG (directed acyclic graph): a blob object hashes file content; a tree object hashes its entries (file names + blob hashes + subtree hashes); a commit object hashes its tree (root directory), parent commit hashes, and metadata. Changing any file changes its blob hash, which changes its parent tree hash, which changes the commit hash. Two commits with the same hash are guaranteed to have identical content — the hash is a cryptographic commitment to the entire repository state. This makes git’s object model tamper-evident: you cannot silently change history without changing commit hashes, which is why git log –verify and signed commits can detect corruption. It also enables efficient storage: identical files across branches share one blob object.”} } ] }