String pattern matching — finding a pattern within a larger text — is a fundamental algorithm that powers text editors, search engines, DNA sequence analysis, and plagiarism detection. While brute force is O(N*M), several algorithms achieve O(N+M). This guide covers KMP, Rabin-Karp, and the Z-algorithm with their interview applications and implementation details.
Brute Force and Its Limitations
Naive approach: for each position i in the text, check if the pattern matches starting at i. Compare text[i..i+M-1] with pattern[0..M-1]. If all characters match, found. If any mismatch, move to i+1 and restart. Time: O(N*M) worst case (e.g., text = “AAAAAAB”, pattern = “AAAB” — at each position, you compare M characters before finding the mismatch). For most practical inputs (random text), the average case is closer to O(N) because mismatches happen early. But for structured or adversarial inputs, the worst case is real. The key inefficiency: when a mismatch occurs at position j in the pattern, brute force discards all information from the partial match and restarts. KMP and Z-algorithm exploit the structure of the pattern to avoid redundant comparisons.
KMP (Knuth-Morris-Pratt) Algorithm
KMP preprocesses the pattern to build a failure function (also called the prefix function or partial match table). The failure function tells you: if a mismatch occurs at position j in the pattern, what is the longest proper prefix of pattern[0..j-1] that is also a suffix? This prefix has already been matched — no need to re-check it. Failure function construction: for pattern “ABCABD”: failure[0] = 0 (no proper prefix), failure[1] = 0 (“AB” has no matching prefix-suffix), failure[2] = 0, failure[3] = 1 (“ABCA” — “A” is both prefix and suffix), failure[4] = 2 (“ABCAB” — “AB” is both prefix and suffix), failure[5] = 0. Matching: compare text[i] with pattern[j]. If match, advance both. If mismatch: if j > 0, set j = failure[j-1] (jump back to the longest matching prefix). If j == 0, advance i. The text pointer i never moves backward — it only advances. Time: O(N+M) — N for matching, M for building the failure function. Space: O(M) for the failure function. KMP is the standard algorithm for single-pattern exact matching.
Rabin-Karp Algorithm
Rabin-Karp uses hashing to compare the pattern with substrings of the text. Compute hash(pattern). Slide a window of length M over the text. At each position, compute hash(text[i..i+M-1]). If the hash matches hash(pattern), verify with a character-by-character comparison (to handle hash collisions). Rolling hash: computing the hash from scratch at each position is O(M), making the total O(N*M). A rolling hash updates the hash in O(1) by removing the contribution of the leftmost character and adding the rightmost. Polynomial rolling hash: H = (c0 * p^(M-1) + c1 * p^(M-2) + … + c(M-1)) mod q. When sliding: H_new = (H – c_old * p^(M-1)) * p + c_new. Time: O(N+M) expected. Worst case O(N*M) if many hash collisions occur (adversarial input). Advantages over KMP: extends easily to multi-pattern matching (check if the hash matches any of K patterns), 2D pattern matching (search for a matrix pattern in a larger matrix), and approximate matching (find substrings with similar hashes). Rabin-Karp is preferred when you need to search for multiple patterns simultaneously or when the pattern comparison is expensive.
Z-Algorithm
The Z-algorithm computes the Z-array: Z[i] = the length of the longest substring starting at position i that matches a prefix of the string. For string “aabxaab”: Z = [-, 1, 0, 0, 3, 1, 0] (Z[0] is undefined or equals the string length). Pattern matching with Z: concatenate pattern + “$” + text (the “$” is a sentinel not in either string). Compute the Z-array. Any position i where Z[i] == len(pattern) is a match starting at position i – len(pattern) – 1 in the text. Time: O(N+M). The Z-algorithm is simpler to implement than KMP and solves the same problems. It also solves additional problems: find the longest prefix of the string that appears as a substring elsewhere, count the number of distinct substrings, and string compression (find the shortest period). Interview applications: (1) Pattern matching (as above). (2) String compression: the shortest period of “abcabcabc” is “abc” (length 3). If N % period == 0, the string is periodic. (3) Longest palindromic prefix: reverse the string, concatenate original + “$” + reversed, compute Z-array on the concatenated string. The largest Z[i] in the reversed part that equals N-i gives the palindromic prefix.
When to Use Each Algorithm
Decision guide: (1) Single pattern, exact match — KMP or Z-algorithm. Both O(N+M). KMP is more commonly asked in interviews. Z-algorithm is simpler to implement. (2) Multiple patterns simultaneously — Rabin-Karp (hash each pattern, check rolling hash against all) or Aho-Corasick (automaton-based, O(N + M + Z) where Z is matches). (3) Approximate matching (allow K mismatches or edits) — dynamic programming (edit distance), not KMP. (4) Repeated substring queries on the same text — suffix array + LCP array. Preprocessing O(N log N), each query O(M log N). (5) Simple interview problems (“implement strStr()”) — KMP or even brute force with explanation of the optimization. In interviews, the most commonly asked pattern matching problems are: implement strStr (KMP), repeated substring pattern (Z-array or KMP failure function), shortest palindrome (KMP or Z on reversed + original), and longest happy prefix (KMP failure function on the string itself). Know KMP well — it appears in ~5% of string problems at top companies.