Hash tables are the most important data structure in computer science — they power dictionaries, sets, caches, database indexes, and countless other systems. Understanding how hash tables work internally gives you an edge in coding interviews, especially when you need to reason about time complexity, collision handling, and when hash tables are not the right choice. This guide covers hash functions, collision resolution, load factors, and the implementation details that interviewers test.
How Hash Tables Work
A hash table stores key-value pairs with O(1) average-case lookup, insert, and delete. Implementation: an array of N slots (buckets). To store a key-value pair: (1) Compute the hash of the key: h = hash(key). (2) Map the hash to an array index: index = h % N. (3) Store the key-value pair at that index. To look up a key: compute the hash, find the index, and retrieve the value. The hash function must be deterministic (same key always produces the same hash), uniform (distribute keys evenly across the array), and fast to compute. When two keys map to the same index (a collision), the hash table must resolve it using chaining or open addressing. Performance depends on the load factor (alpha = number_of_entries / array_size). As the load factor increases, collisions become more frequent and performance degrades. When the load factor exceeds a threshold (typically 0.75), the array is resized (doubled) and all entries are rehashed.
Hash Functions
Properties of a good hash function: (1) Uniform distribution — keys should be spread evenly across the hash space. A hash function that maps all strings starting with “A” to the same bucket is terrible. (2) Deterministic — the same input always produces the same output. (3) Fast — hashing should be O(1) or O(L) where L is the key length. (4) Avalanche effect — a small change in input (one bit flip) should produce a drastically different hash. Common hash functions for hash tables: (1) Division method: h(k) = k mod m. Choose m as a prime number not close to a power of 2. (2) Multiplication method: h(k) = floor(m * (k * A mod 1)) where A is a constant (Knuth suggests A = (sqrt(5) – 1) / 2). (3) MurmurHash3 and xxHash — fast, non-cryptographic hash functions with excellent distribution. Used in production hash tables (HashMap in Java uses a variant). For string keys: process each character (polynomial hash: h = h * 31 + char). Java String.hashCode() uses this approach. Python randomizes the hash seed on each run to prevent hash flooding DoS attacks.
Collision Resolution: Chaining
Separate chaining: each bucket contains a linked list (or other collection) of entries that hash to the same index. On insert: add the entry to the list at the computed index. On lookup: traverse the list at the computed index, comparing keys until the target is found. On delete: find and remove the entry from the list. Average case: if the load factor is alpha, the average chain length is alpha. With alpha = 0.75 and a good hash function, most chains have 0-1 entries. Lookup is O(1 + alpha) on average. Worst case: if all N keys hash to the same bucket (pathological hash function or adversarial input), lookup degrades to O(N) (scanning a single long chain). Java HashMap: uses chaining. When a chain exceeds 8 entries and the table has >= 64 buckets, the chain is converted from a linked list to a red-black tree. This improves worst-case lookup from O(N) to O(log N) per bucket. Chaining pros: simple implementation, degrades gracefully (no clustering), and supports load factors > 1 (chains grow arbitrarily). Cons: pointer overhead (each entry has a next pointer), and cache unfriendly (linked list nodes are scattered in memory).
Collision Resolution: Open Addressing
Open addressing: all entries are stored in the array itself (no linked lists). On collision, probe for the next available slot. Probing strategies: (1) Linear probing: check index, index+1, index+2, … Simple but causes clustering (a block of filled slots grows and captures more collisions). (2) Quadratic probing: check index, index+1, index+4, index+9, … Reduces clustering but may not visit all slots. (3) Double hashing: use a second hash function to determine the probe step: check index, index+h2(key), index+2*h2(key), … Best distribution, minimal clustering. On lookup: compute the index, check the entry. If the key matches, return the value. If the slot is empty, the key does not exist. If the slot contains a different key (collision), continue probing. On delete: cannot simply empty the slot (it breaks the probe chain for other entries). Use a tombstone marker: “this slot is empty but was previously occupied — continue probing past it.” Open addressing pros: cache friendly (entries are contiguous in memory), no pointer overhead. Cons: load factor must stay below ~0.7 (performance degrades rapidly above), deletes are complicated (tombstones), and the table cannot store more entries than slots. Python dict uses open addressing with quadratic probing.
Rehashing and Dynamic Resizing
When the load factor exceeds a threshold (0.75 for Java HashMap, 2/3 for Python dict), the hash table resizes to maintain O(1) performance. Resizing process: (1) Allocate a new array with double the capacity. (2) Rehash every entry: compute its new index in the larger array (h % new_capacity) and insert it. (3) Replace the old array with the new one. Rehashing is O(N) — every entry must be moved. However, since rehashing happens only when the table doubles, and each entry is rehashed at most O(log N) times across all doublings, the amortized cost per insert is O(1). Shrinking: some implementations shrink the array when the load factor drops below a threshold (e.g., 0.25) to save memory. Java HashMap does not shrink. Python dict shrinks. Interview implications: hash table insert is O(1) amortized but O(N) worst case (during rehash). If asked about worst-case guarantees, this matters. For real-time systems where a single O(N) operation is unacceptable, consider incremental rehashing (Redis does this — move a few entries per operation instead of all at once) or fixed-size hash tables with a known maximum capacity.