Why String Hashing?
Naive string comparison is O(L) per comparison where L is the string length. String hashing converts a string to an integer in O(L) preprocessing, then compares in O(1). This enables: finding duplicate substrings in O(n) average, pattern matching (Rabin-Karp) in O(n+m), comparing all n² substrings in O(n²) instead of O(n³).
Polynomial Rolling Hash
# hash(s) = s[0]*p^(L-1) + s[1]*p^(L-2) + ... + s[L-1]*p^0 (mod M)
# p = 31 (for lowercase letters), M = large prime (10^9+7)
def build_hash(s, p=31, M=10**9+7):
h = 0
p_pow = 1
for c in s:
h = (h + (ord(c) - ord('a') + 1) * p_pow) % M
p_pow = p_pow * p % M
return h
For substring hashing with O(1) queries, use prefix hashes:
def build_prefix_hash(s, p=31, M=10**9+7):
n = len(s)
h = [0] * (n + 1)
p_pow = [1] * (n + 1)
for i in range(n):
h[i+1] = (h[i] + (ord(s[i]) - ord('a') + 1) * p_pow[i]) % M
p_pow[i+1] = p_pow[i] * p % M
return h, p_pow
def get_hash(h, p_pow, l, r, M=10**9+7):
# hash of s[l..r] (inclusive)
return (h[r+1] - h[l] * p_pow[r-l+1]) % M
Rabin-Karp Pattern Matching
Find all occurrences of pattern P in text T. Compute hash(P) and rolling hash of each window of length |P| in T. On hash match, verify with string comparison (to handle collisions). Average O(n+m), worst case O(nm) if many hash collisions.
def rabin_karp(text, pattern, p=31, M=10**9+7):
n, m = len(text), len(pattern)
if m > n: return []
h_t, p_pow = build_prefix_hash(text, p, M)
h_p, _ = build_prefix_hash(pattern, p, M)
target = get_hash(h_p, p_pow, 0, m-1, M)
result = []
for i in range(n - m + 1):
if get_hash(h_t, p_pow, i, i+m-1, M) == target:
if text[i:i+m] == pattern: # verify to avoid false positive
result.append(i)
return result
Longest Duplicate Substring (LC 1044)
Binary search on the length L. For each candidate length, check if any substring of length L appears twice using a hash set. O(n log n) with hashing, O(n log² n) with suffix array.
def longestDupSubstring(s):
def has_dup(L):
seen = set()
h, p_pow = build_prefix_hash(s)
for i in range(len(s) - L + 1):
hv = get_hash(h, p_pow, i, i+L-1)
if hv in seen: return i # found duplicate at position i
seen.add(hv)
return -1
lo, hi = 1, len(s) - 1
start = 0
while lo <= hi:
mid = (lo + hi) // 2
pos = has_dup(mid)
if pos != -1:
start = pos
lo = mid + 1
else:
hi = mid - 1
return s[start:start+hi]
Double Hashing (Reduce Collision Probability)
Use two independent hash functions and store (hash1, hash2) pairs. Collision probability: 1/(M1*M2) ≈ 10^-18 for M1=M2=10^9+7. In competitive programming, double hashing makes hash collision attacks infeasible.
def double_hash(s):
M1, M2 = 10**9+7, 10**9+9
P1, P2 = 31, 37
h1 = build_hash(s, P1, M1)
h2 = build_hash(s, P2, M2)
return (h1, h2) # collision probability ~10^-18
String Hashing vs Suffix Array vs Z-Algorithm
- Rolling hash: simple to implement, O(n) average for pattern matching, handles arbitrary substring queries in O(1) after O(n) preprocessing. Risk of collisions.
- KMP/Z-algorithm: exact O(n+m) pattern matching, no collision risk. But handles only one pattern at a time.
- Suffix array: O(n log n) build, O(m log n) search, handles all pattern matching queries. Used for longest common substring of multiple strings.
- For interviews: rolling hash is usually sufficient. Mention collision handling (double hash or verify on match).
When to Use String Hashing
- Detect duplicate substrings (with binary search on length)
- Multi-pattern matching (hash all patterns, check each window)
- Comparing all substrings of two strings (anagram detection, LCS)
- Rolling window equivalence checks
Google coding interviews test string hashing and pattern matching. See common questions for Google interview: string hashing and pattern matching problems.
Meta coding interviews test string hashing and substring matching. See patterns for Meta interview: string hashing and substring problems.
Atlassian coding rounds include string hashing and pattern matching. See patterns for Atlassian interview: string hashing and pattern problems.