Coding Interview: Hash Table Internals — Hash Functions, Collision Resolution, Chaining, Open Addressing, Rehashing

Hash tables are the most important data structure in computer science — they power dictionaries, sets, caches, database indexes, and countless other systems. Understanding how hash tables work internally gives you an edge in coding interviews, especially when you need to reason about time complexity, collision handling, and when hash tables are not the right choice. This guide covers hash functions, collision resolution, load factors, and the implementation details that interviewers test.

How Hash Tables Work

A hash table stores key-value pairs with O(1) average-case lookup, insert, and delete. Implementation: an array of N slots (buckets). To store a key-value pair: (1) Compute the hash of the key: h = hash(key). (2) Map the hash to an array index: index = h % N. (3) Store the key-value pair at that index. To look up a key: compute the hash, find the index, and retrieve the value. The hash function must be deterministic (same key always produces the same hash), uniform (distribute keys evenly across the array), and fast to compute. When two keys map to the same index (a collision), the hash table must resolve it using chaining or open addressing. Performance depends on the load factor (alpha = number_of_entries / array_size). As the load factor increases, collisions become more frequent and performance degrades. When the load factor exceeds a threshold (typically 0.75), the array is resized (doubled) and all entries are rehashed.

Hash Functions

Properties of a good hash function: (1) Uniform distribution — keys should be spread evenly across the hash space. A hash function that maps all strings starting with “A” to the same bucket is terrible. (2) Deterministic — the same input always produces the same output. (3) Fast — hashing should be O(1) or O(L) where L is the key length. (4) Avalanche effect — a small change in input (one bit flip) should produce a drastically different hash. Common hash functions for hash tables: (1) Division method: h(k) = k mod m. Choose m as a prime number not close to a power of 2. (2) Multiplication method: h(k) = floor(m * (k * A mod 1)) where A is a constant (Knuth suggests A = (sqrt(5) – 1) / 2). (3) MurmurHash3 and xxHash — fast, non-cryptographic hash functions with excellent distribution. Used in production hash tables (HashMap in Java uses a variant). For string keys: process each character (polynomial hash: h = h * 31 + char). Java String.hashCode() uses this approach. Python randomizes the hash seed on each run to prevent hash flooding DoS attacks.

Collision Resolution: Chaining

Separate chaining: each bucket contains a linked list (or other collection) of entries that hash to the same index. On insert: add the entry to the list at the computed index. On lookup: traverse the list at the computed index, comparing keys until the target is found. On delete: find and remove the entry from the list. Average case: if the load factor is alpha, the average chain length is alpha. With alpha = 0.75 and a good hash function, most chains have 0-1 entries. Lookup is O(1 + alpha) on average. Worst case: if all N keys hash to the same bucket (pathological hash function or adversarial input), lookup degrades to O(N) (scanning a single long chain). Java HashMap: uses chaining. When a chain exceeds 8 entries and the table has >= 64 buckets, the chain is converted from a linked list to a red-black tree. This improves worst-case lookup from O(N) to O(log N) per bucket. Chaining pros: simple implementation, degrades gracefully (no clustering), and supports load factors > 1 (chains grow arbitrarily). Cons: pointer overhead (each entry has a next pointer), and cache unfriendly (linked list nodes are scattered in memory).

Collision Resolution: Open Addressing

Open addressing: all entries are stored in the array itself (no linked lists). On collision, probe for the next available slot. Probing strategies: (1) Linear probing: check index, index+1, index+2, … Simple but causes clustering (a block of filled slots grows and captures more collisions). (2) Quadratic probing: check index, index+1, index+4, index+9, … Reduces clustering but may not visit all slots. (3) Double hashing: use a second hash function to determine the probe step: check index, index+h2(key), index+2*h2(key), … Best distribution, minimal clustering. On lookup: compute the index, check the entry. If the key matches, return the value. If the slot is empty, the key does not exist. If the slot contains a different key (collision), continue probing. On delete: cannot simply empty the slot (it breaks the probe chain for other entries). Use a tombstone marker: “this slot is empty but was previously occupied — continue probing past it.” Open addressing pros: cache friendly (entries are contiguous in memory), no pointer overhead. Cons: load factor must stay below ~0.7 (performance degrades rapidly above), deletes are complicated (tombstones), and the table cannot store more entries than slots. Python dict uses open addressing with quadratic probing.

Rehashing and Dynamic Resizing

When the load factor exceeds a threshold (0.75 for Java HashMap, 2/3 for Python dict), the hash table resizes to maintain O(1) performance. Resizing process: (1) Allocate a new array with double the capacity. (2) Rehash every entry: compute its new index in the larger array (h % new_capacity) and insert it. (3) Replace the old array with the new one. Rehashing is O(N) — every entry must be moved. However, since rehashing happens only when the table doubles, and each entry is rehashed at most O(log N) times across all doublings, the amortized cost per insert is O(1). Shrinking: some implementations shrink the array when the load factor drops below a threshold (e.g., 0.25) to save memory. Java HashMap does not shrink. Python dict shrinks. Interview implications: hash table insert is O(1) amortized but O(N) worst case (during rehash). If asked about worst-case guarantees, this matters. For real-time systems where a single O(N) operation is unacceptable, consider incremental rehashing (Redis does this — move a few entries per operation instead of all at once) or fixed-size hash tables with a known maximum capacity.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a hash table achieve O(1) average lookup time?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A hash table stores key-value pairs in an array. To look up a key: (1) compute hash(key) to get an integer, (2) map to an array index: index = hash % array_size, (3) retrieve the value at that index. With a good hash function that distributes keys uniformly and a load factor below 0.75, most array slots have 0 or 1 entries. The lookup is a single array access: O(1). When collisions occur (two keys map to the same index), resolution adds a small cost: with chaining, traverse a short linked list (average length = load_factor, typically 1. Cons: pointer overhead, cache-unfriendly (list nodes scattered in memory). Used by: Java HashMap, Go map. Open addressing: all entries stored in the array itself. On collision, probe for the next available slot (linear probing: check i, i+1, i+2; quadratic: i, i+1, i+4, i+9; double hashing: i, i+h2(key), i+2*h2(key)). Lookup probes until finding the key or an empty slot. Deletion uses tombstones (mark as deleted, continue probing past). Pros: cache-friendly (contiguous memory), no pointer overhead. Cons: load factor must stay below ~0.7, clustering with linear probing. Used by: Python dict, Rust HashMap. Choice: open addressing is faster for small-to-medium tables (cache efficiency). Chaining is simpler and handles high load factors better.”}},{“@type”:”Question”,”name”:”What happens when a hash table needs to resize?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When the load factor (entries/array_size) exceeds a threshold (0.75 for Java, 2/3 for Python), the table resizes to maintain O(1) performance. Process: (1) allocate a new array with double the capacity, (2) rehash every entry (compute new index = hash % new_capacity), (3) insert each entry into the new array. Rehashing is O(N) — every entry is moved. However, since the table doubles each time, each entry is rehashed O(log N) times across all resizings. The amortized cost per insert remains O(1). Important for interviews: insert is O(1) amortized but O(N) worst case (during rehash). For real-time systems, this spike is problematic. Solutions: incremental rehashing (Redis does this — moves a few entries per operation instead of all at once), or pre-allocate with known maximum size. Some implementations also shrink when load factor drops below 0.25 to reclaim memory.”}},{“@type”:”Question”,”name”:”What makes a good hash function for a hash table?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Properties: (1) Uniform distribution — keys spread evenly across buckets. A function that clusters outputs wastes buckets and creates long chains. (2) Deterministic — same input always produces the same hash. (3) Fast — O(1) for fixed-size keys, O(L) for variable-length keys (strings). Hashing should not be the bottleneck. (4) Avalanche effect — small input changes produce vastly different hashes. This prevents similar keys from clustering. Common approaches: for integers — multiplication method or modular arithmetic with a prime table size. For strings — polynomial rolling hash (h = h * 31 + char, used by Java String.hashCode). For general use — MurmurHash3 or xxHash (fast, excellent distribution). Security consideration: predictable hash functions enable hash flooding DoS attacks (attacker crafts keys that all collide). Python randomizes the hash seed per process. Java 8+ converts long chains to trees. In interviews, know that hash function quality directly affects performance — a bad hash function turns O(1) into O(N).”}}]}
Scroll to Top