Introduction
Every time you hit Ctrl+F in your browser, search for a specific log line in a massive 50GB file, or rely on a DNA sequencer to find exact gene permutations, complex String Matching Algorithms are working silently behind the scenes.
The String Matching Problem answers a fundamental structural computer science question: Given a large text T of length n, and a shorter pattern P of length m, where exactly does P occur in T?
While a naive brute-force approach works for small strings, it degrades drastically when analyzing log files or performing millions of searches. Over the decades, computer scientists have developed sophisticated mathematical approaches to avoid redundant comparisons. This article explores the four most important string search algorithms you need to know in 2026.
1. The Naive Approach (Brute Force)
The most intuitive way to solve this problem is to align the pattern at the very first character of the text, check if it matches, and if not, shift the pattern to the right by exactly one character and try again.
Python Implementation
def naive_search(text: str, pattern: str) -> list[int]:
n, m = len(text), len(pattern)
matches = []
# We only need to slide up to n - m + 1
for i in range(n - m + 1):
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
matches.append(i)
return matches
Complexity
- Time Complexity: $O(n \times m)$ in the worst case (e.g., searching for
AAAAABinAAAAAAAAAAAAA). - Space Complexity: $O(1)$.
- Verdict: Too slow for production systems parsing large payloads.
2. Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm achieves a linear $O(n + m)$ time complexity by mathematically pre-processing the pattern to understand its own repetitive internal structure.
The genius insight of KMP is simple: When a mismatch occurs, the characters you just successfully matched contain structural information that dictates exactly how far you can jump forward safely, bypassing redundant comparisons.
The “LPS” Array (Longest Proper Prefix which is also Suffix)
To know how far to jump, KMP calculates an LPS array of size m (the length of the pattern).
For example, if the pattern is A B A B C A B A B A:
- Substring
A B A Bhas a proper prefixA Bwhich perfectly matches its suffixA B. - The LPS value is
2.
def compute_lps(pattern: str) -> list[int]:
m = len(pattern)
lps = [0] * m
length = 0 # length of the previous longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
# Tricky part: We do not increment i here
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
Searching with KMP
Once the LPS array is calculated, we traverse the text. When a mismatch occurs at text[i] and pattern[j], instead of resetting j to 0 and i to i-j+1 (like Naive), we simply update j = lps[j-1] and leave i exactly where it is!
def kmp_search(text: str, pattern: str) -> list[int]:
n, m = len(text), len(pattern)
if m == 0: return []
lps = compute_lps(pattern)
matches = []
i = 0 # index for text
j = 0 # index for pattern
while i < n:
if pattern[j] == text[i]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1]
elif i < n and pattern[j] != text[i]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
Complexity
- Time Complexity: $O(n + m)$. We never step completely backwards in the text.
- Space Complexity: $O(m)$ to store the LPS array.
3. Rabin-Karp Algorithm (Rolling Hash)
Rabin-Karp takes a cryptographic approach rather than a structural one. It calculates a numerical hash for the target pattern. Then, it slides a “window” across the text, calculating the hash of the text inside the window. If the hashes match, it verifies the string manually (to prevent hash collisions).
The Rolling Hash Magic
Recalculating a hash for every window from scratch would make the algorithm $O(n \times m)$. The secret relies on a Rolling Hash. When the window slides one character to the right, we mathematically subtract the value of the character falling out of the left side, multiply the base, and add the new character entering the right side—an $O(1)$ operation!
def rabin_karp_search(text: str, pattern: str) -> list[int]:
n, m = len(text), len(pattern)
if m == 0 or m > n: return []
d = 256 # Number of characters in the input alphabet
q = 10**9 + 7 # A large prime number to avoid collisions
p_hash = 0 # Hash value for pattern
t_hash = 0 # Hash value for text window
h = pow(d, m - 1) % q
matches = []
# 1. Calculate initial hash values for pattern and first window
for i in range(m):
p_hash = (d * p_hash + ord(pattern[i])) % q
t_hash = (d * t_hash + ord(text[i])) % q
# 2. Slide the pattern over text one by one
for i in range(n - m + 1):
# 3. If hashes match, verify character by character
if p_hash == t_hash:
if text[i:i+m] == pattern:
matches.append(i)
# 4. Calculate hash for next window (Rolling Hash calculation)
if i < n - m:
t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
if t_hash < 0:
t_hash += q
return matches
Complexity
- Time Complexity: $O(n + m)$ average case, but $O(n \times m)$ worst case if terrifying hash collisions happen continually.
- Space Complexity: $O(1)$.
- Best Use Case: Rabin-Karp shines remarkably when searching for Multiple Patterns simultaneously. You can pre-hash 100 patterns, and check the rolling text hash against a Hash Set in $O(1)$ time.
4. Boyer-Moore Algorithm (The Industry Standard)
If you use grep or your code editor’s Cmd+F, you are almost certainly using a variation of Boyer-Moore.
Unlike KMP or Naive, Boyer-Moore radically completely flips the approach: It aligns the pattern with the text, but compares the characters backwards, from Right to Left!
The Bad Character Heuristic
If we match backwards, and we find a mismatch on a character that does not even exist anywhere in the pattern, we don’t just shift by 1. We can aggressively jump the pattern entirely past that foreign character, shifting by m spaces instantly!
By combining the “Bad Character Heuristic” and the “Good Suffix Heuristic”, Boyer-Moore skips massive chunks of the text. Because it skips so aggressively, it actually becomes faster as the pattern gets longer.
# Simplified Bad Character Implementation of Boyer-Moore
def boyer_moore_search(text: str, pattern: str) -> list[int]:
n, m = len(text), len(pattern)
if m == 0 or m > n: return []
# Preprocess: Record the right-most occurrence of every character
bad_char = {}
for i in range(m):
bad_char[pattern[i]] = i
matches = []
s = 0 # Shift of the pattern with respect to text
while s <= n - m:
j = m - 1
# Keep reducing index j while characters match
while j >= 0 and pattern[j] == text[s + j]:
j -= 1
if j < 0:
# Pattern found!
matches.append(s)
# Shift the pattern so that the next character aligns with its last occurrence
s += (m - bad_char.get(text[s + m], -1)) if s + m < n else 1
else:
# Mismatch. Shift pattern to align bad character
# max() ensures we don't accidentally shift backwards
s += max(1, j - bad_char.get(text[s + j], -1))
return matches
Complexity
- Time Complexity: Best case reaches an astonishing $O(n / m)$. As the pattern gets larger, it takes less time.
- Space Complexity: $O(\Sigma)$ where $\Sigma$ is the size of the alphabet.
Algorithm Selection Guide
| Algorithm | Average Time | Space | Best Use Case |
|---|---|---|---|
| Naive | $O(n \times m)$ | $O(1)$ | Tiny strings, simple scripts. |
| KMP | $O(n+m)$ | $O(m)$ | Guaranteeing worst-case linear time. DNA sequences with small alphabets (A,C,T,G). |
| Rabin-Karp | $O(n+m)$ | $O(1)$ | Searching for multiple patterns concurrently (Plagiarism matching). |
| Boyer-Moore | $O(n/m)$ | $O(\Sigma)$ | Most real-world production text engines, editors, and grep. |
Summary
String matching forms the backbone of digital search.
- Use KMP when you absolutely cannot afford worst-case performance degradation.
- Rely on Rabin-Karp’s rolling hash when you need to match 10,000 different banned words against a document in one pass.
- Use Boyer-Moore for almost all standard day-to-day text searching, as its right-to-left scanning provides incredibly fast linear skipping.
Advanced Multiple-Pattern Matching: Aho-Corasick
When you need to search an input text array for massive dictionaries (e.g., thousands of malicious virus signatures, or a dictionary of swear words for an online forum), running KMP or Boyer-Moore thousand times sequentially becomes an $O(\text{patterns} \times n)$ bottleneck.
Automaton Machines
The Aho-Corasick algorithm elegantly solves this by constructing a massive Trie (Prefix Tree) out of the entire dictionary of patterns, and then converting that Trie into a specialized Finite State Machine.
As you stream the enormous text exactly once, you trace down the nodes of the Trie. If a mismatch occurs, Aho-Corasick relies on pre-computed “Failure Links”—extremely similar to KMP’s LPS array—that safely redirect your state to the next longest possible matching prefix without ever reversing the text index.
This results in a breathtaking $O(n + m + z)$ time complexity, where:
- $n$ is the length of the text.
- $m$ is the combined length of every single pattern in the dictionary.
- $z$ is the total count of matches found.
Most modern Intrusion Detection Systems (IDS/IPS) and Unix fgrep depend entirely on Aho-Corasick.
Resources
- MIT OpenCourseWare: String Matching (KMP & Rabin-Karp)
- GeeksForGeeks: KMP Algorithm
- Wikipedia: Boyer-Moore Algorithm
Comments