String Hashing Interview Patterns

Why String Hashing?

Naive string comparison is O(L) per comparison where L is the string length. String hashing converts a string to an integer in O(L) preprocessing, then compares in O(1). This enables: finding duplicate substrings in O(n) average, pattern matching (Rabin-Karp) in O(n+m), comparing all n² substrings in O(n²) instead of O(n³).

Polynomial Rolling Hash

# hash(s) = s[0]*p^(L-1) + s[1]*p^(L-2) + ... + s[L-1]*p^0  (mod M)
# p = 31 (for lowercase letters), M = large prime (10^9+7)

def build_hash(s, p=31, M=10**9+7):
    h = 0
    p_pow = 1
    for c in s:
        h = (h + (ord(c) - ord('a') + 1) * p_pow) % M
        p_pow = p_pow * p % M
    return h

For substring hashing with O(1) queries, use prefix hashes:

def build_prefix_hash(s, p=31, M=10**9+7):
    n = len(s)
    h = [0] * (n + 1)
    p_pow = [1] * (n + 1)
    for i in range(n):
        h[i+1] = (h[i] + (ord(s[i]) - ord('a') + 1) * p_pow[i]) % M
        p_pow[i+1] = p_pow[i] * p % M
    return h, p_pow

def get_hash(h, p_pow, l, r, M=10**9+7):
    # hash of s[l..r] (inclusive)
    return (h[r+1] - h[l] * p_pow[r-l+1]) % M

Rabin-Karp Pattern Matching

Find all occurrences of pattern P in text T. Compute hash(P) and rolling hash of each window of length |P| in T. On hash match, verify with string comparison (to handle collisions). Average O(n+m), worst case O(nm) if many hash collisions.

def rabin_karp(text, pattern, p=31, M=10**9+7):
    n, m = len(text), len(pattern)
    if m > n: return []

    h_t, p_pow = build_prefix_hash(text, p, M)
    h_p, _ = build_prefix_hash(pattern, p, M)
    target = get_hash(h_p, p_pow, 0, m-1, M)

    result = []
    for i in range(n - m + 1):
        if get_hash(h_t, p_pow, i, i+m-1, M) == target:
            if text[i:i+m] == pattern:  # verify to avoid false positive
                result.append(i)
    return result

Longest Duplicate Substring (LC 1044)

Binary search on the length L. For each candidate length, check if any substring of length L appears twice using a hash set. O(n log n) with hashing, O(n log² n) with suffix array.

def longestDupSubstring(s):
    def has_dup(L):
        seen = set()
        h, p_pow = build_prefix_hash(s)
        for i in range(len(s) - L + 1):
            hv = get_hash(h, p_pow, i, i+L-1)
            if hv in seen: return i  # found duplicate at position i
            seen.add(hv)
        return -1

    lo, hi = 1, len(s) - 1
    start = 0
    while lo <= hi:
        mid = (lo + hi) // 2
        pos = has_dup(mid)
        if pos != -1:
            start = pos
            lo = mid + 1
        else:
            hi = mid - 1
    return s[start:start+hi]

Double Hashing (Reduce Collision Probability)

Use two independent hash functions and store (hash1, hash2) pairs. Collision probability: 1/(M1*M2) ≈ 10^-18 for M1=M2=10^9+7. In competitive programming, double hashing makes hash collision attacks infeasible.

def double_hash(s):
    M1, M2 = 10**9+7, 10**9+9
    P1, P2 = 31, 37
    h1 = build_hash(s, P1, M1)
    h2 = build_hash(s, P2, M2)
    return (h1, h2)  # collision probability ~10^-18

String Hashing vs Suffix Array vs Z-Algorithm

  • Rolling hash: simple to implement, O(n) average for pattern matching, handles arbitrary substring queries in O(1) after O(n) preprocessing. Risk of collisions.
  • KMP/Z-algorithm: exact O(n+m) pattern matching, no collision risk. But handles only one pattern at a time.
  • Suffix array: O(n log n) build, O(m log n) search, handles all pattern matching queries. Used for longest common substring of multiple strings.
  • For interviews: rolling hash is usually sufficient. Mention collision handling (double hash or verify on match).

When to Use String Hashing

  • Detect duplicate substrings (with binary search on length)
  • Multi-pattern matching (hash all patterns, check each window)
  • Comparing all substrings of two strings (anagram detection, LCS)
  • Rolling window equivalence checks


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a polynomial rolling hash work for strings?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A polynomial rolling hash maps a string to an integer: hash(s) = s[0]*p^(L-1) + s[1]*p^(L-2) + … + s[L-1]*p^0 (mod M). Common choices: p=31 for lowercase letters, p=37 for uppercase+lowercase, M=10^9+7 (prime). Each character maps to its position in the alphabet (a=1, b=2, …). The polynomial representation ensures different strings map to different values with high probability. For substring hashing in O(1): build a prefix hash array where prefix[i] = sum of character contributions up to index i. Query hash(s[l..r]) = (prefix[r+1] – prefix[l] * p_pow[r-l+1]) mod M. This enables comparing any two substrings in O(1) after O(n) preprocessing.”}},{“@type”:”Question”,”name”:”How does Rabin-Karp use rolling hash for pattern matching?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Rabin-Karp computes the hash of the pattern and the hash of each window of length m in the text. If hashes match, verify with direct string comparison (to handle hash collisions). The key optimization: computing the hash of the next window from the current window takes O(1) using the rolling hash formula: remove the contribution of the leftmost character, shift all remaining characters, add the new rightmost character. Without rolling: recomputing each window hash from scratch takes O(m) per window → O(nm) total. With rolling: O(1) per window → O(n+m) total. Use case: finding all occurrences of pattern P in text T, or finding if any of k patterns appears in T (compute all k pattern hashes, check each window against the set).”}},{“@type”:”Question”,”name”:”How do you find the longest duplicate substring using binary search and hashing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Binary search on the length L: if there exists a duplicate substring of length L, then there also exists one of length L-1 (any duplicate of length L contains duplicates of length L-1). This monotonic property enables binary search. For each candidate length L: use a prefix hash array to compute the hash of every substring of length L in O(n) total. Store hashes in a set — if any hash is seen twice, a duplicate exists. Binary search finds the maximum valid L in O(log n) iterations, each O(n) → O(n log n) total. Handle hash collisions by double-hashing or verifying the actual substring on collision. This is LC 1044. Suffix array approach: O(n log n) build + O(n) LCP → also O(n log n) but collision-free.”}},{“@type”:”Question”,”name”:”What is double hashing and why is it used?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Double hashing uses two independent hash functions and stores (hash1, hash2) pairs as the effective hash. If M1 and M2 are both ~10^9+7, the collision probability of double hashing is ~1/(M1*M2) ≈ 10^-18 per comparison. Single hashing has collision probability ~1/M ≈ 10^-9 per comparison — with n^2 comparisons, expected collisions = n^2/M, which can be 1 for n=30,000. Adversarial inputs can be crafted to force collisions with a single hash, causing O(n^2) behavior in hash-set solutions. Double hashing prevents this. Implementation: maintain two prefix hash arrays with different (p, M) pairs; store (h1, h2) tuples in the set; compare tuples.”}},{“@type”:”Question”,”name”:”When should you use string hashing vs KMP vs Z-algorithm for pattern matching?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”KMP (Knuth-Morris-Pratt) and Z-algorithm: O(n+m), no collision risk, easy to implement, but handle only one pattern at a time. Use when matching a single fixed pattern, especially in production code where correctness matters more than generality. Rabin-Karp (rolling hash): O(n+m) average, O(nm) worst case (rare with good hash), handles multiple patterns efficiently (store all pattern hashes in a set, check each window against the set in O(1)). Use for multi-pattern matching or substring duplicate problems. Suffix array: O(n log n) build, O(m log n) search per query, handles all substring queries. Use for: longest common substring of multiple strings, number of distinct substrings, lexicographically sorted suffixes. For interviews: rolling hash is usually the right choice — simple, flexible, fast enough.”}}]}

Google coding interviews test string hashing and pattern matching. See common questions for Google interview: string hashing and pattern matching problems.

Meta coding interviews test string hashing and substring matching. See patterns for Meta interview: string hashing and substring problems.

Atlassian coding rounds include string hashing and pattern matching. See patterns for Atlassian interview: string hashing and pattern problems.

Scroll to Top