String Matching Algorithms: KMP, Rabin-Karp, and Boyer-Moore Deep Dive

Introduction

Every time you hit Ctrl+F in your browser, search for a specific log line in a massive 50GB file, or rely on a DNA sequencer to find exact gene permutations, complex String Matching Algorithms are working silently behind the scenes.

The String Matching Problem answers a fundamental structural computer science question: Given a large text T of length n, and a shorter pattern P of length m, where exactly does P occur in T?

While a naive brute-force approach works for small strings, it degrades drastically when analyzing log files or performing millions of searches. Over the decades, computer scientists have developed sophisticated mathematical approaches to avoid redundant comparisons. This article explores the four most important string search algorithms you need to know in 2026.

1. The Naive Approach (Brute Force)

The most intuitive way to solve this problem is to align the pattern at the very first character of the text, check if it matches, and if not, shift the pattern to the right by exactly one character and try again.

Python Implementation

def naive_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    matches = []
    
    # We only need to slide up to n - m + 1
    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
                
        if match:
            matches.append(i)
            
    return matches
```yaml

### Complexity

- **Time Complexity:** $O(n \times m)$ in the worst case (e.g., searching for `AAAAAB` in `AAAAAAAAAAAAA`).
- **Space Complexity:** $O(1)$.
- **Verdict:** Too slow for production systems parsing large payloads.

---

## 2. Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm achieves a linear $O(n + m)$ time complexity by mathematically pre-processing the pattern to understand its own repetitive internal structure.

The genius insight of KMP is simple: **When a mismatch occurs, the characters you just successfully matched contain structural information that dictates exactly how far you can jump forward safely, bypassing redundant comparisons.**

### The "LPS" Array (Longest Proper Prefix which is also Suffix)

To know how far to jump, KMP calculates an `LPS` array of size `m` (the length of the pattern).

For example, if the pattern is `A B A B C A B A B A`:

- Substring `A B A B` has a proper prefix `A B` which perfectly matches its suffix `A B`.
- The LPS value is `2`.

```python
def compute_lps(pattern: str) -> list[int]:
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of the previous longest prefix suffix
    i = 1
    
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Tricky part: We do not increment i here
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

Searching with KMP

Once the LPS array is calculated, we traverse the text. When a mismatch occurs at text[i] and pattern[j], instead of resetting j to 0 and i to i-j+1 (like Naive), we simply update j = lps[j-1] and leave i exactly where it is!

def kmp_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0: return []
    
    lps = compute_lps(pattern)
    matches = []
    i = 0  # index for text
    j = 0  # index for pattern
    
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
            
        if j == m:
            matches.append(i - j)
            j = lps[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
                
    return matches
```python
### Complexity

- **Time Complexity:** $O(n + m)$. We never step completely backwards in the text.
- **Space Complexity:** $O(m)$ to store the LPS array.

---

## 3. Rabin-Karp Algorithm (Rolling Hash)

Rabin-Karp takes a cryptographic approach rather than a structural one. It calculates a numerical hash for the target pattern. Then, it slides a "window" across the text, calculating the hash of the text inside the window. If the hashes match, it verifies the string manually (to prevent hash collisions).

### The Rolling Hash Magic

Recalculating a hash for every window from scratch would make the algorithm $O(n \times m)$. The secret relies on a **Rolling Hash**. When the window slides one character to the right, we mathematically subtract the value of the character falling out of the left side, multiply the base, and add the new character entering the right side—an $O(1)$ operation!

```python
def rabin_karp_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0 or m > n: return []
    
    d = 256  # Number of characters in the input alphabet
    q = 10**9 + 7  # A large prime number to avoid collisions
    
    p_hash = 0  # Hash value for pattern
    t_hash = 0  # Hash value for text window
    h = pow(d, m - 1) % q
    
    matches = []
    
    # 1. Calculate initial hash values for pattern and first window
    for i in range(m):
        p_hash = (d * p_hash + ord(pattern[i])) % q
        t_hash = (d * t_hash + ord(text[i])) % q
        
    # 2. Slide the pattern over text one by one
    for i in range(n - m + 1):
        # 3. If hashes match, verify character by character
        if p_hash == t_hash:
            if text[i:i+m] == pattern:
                matches.append(i)
                
        # 4. Calculate hash for next window (Rolling Hash calculation)
        if i < n - m:
            t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q
            if t_hash < 0:
                t_hash += q
                
    return matches

Complexity

Time Complexity: $O(n + m)$ average case, but $O(n \times m)$ worst case if terrifying hash collisions happen continually.
Space Complexity: $O(1)$.
Best Use Case: Rabin-Karp shines remarkably when searching for Multiple Patterns simultaneously. You can pre-hash 100 patterns, and check the rolling text hash against a Hash Set in $O(1)$ time.

4. Boyer-Moore Algorithm (The Industry Standard)

If you use grep or your code editor’s Cmd+F, you are almost certainly using a variation of Boyer-Moore.

Unlike KMP or Naive, Boyer-Moore radically completely flips the approach: It aligns the pattern with the text, but compares the characters backwards, from Right to Left!

The Bad Character Heuristic

If we match backwards, and we find a mismatch on a character that does not even exist anywhere in the pattern, we don’t just shift by 1. We can aggressively jump the pattern entirely past that foreign character, shifting by m spaces instantly!

By combining the “Bad Character Heuristic” and the “Good Suffix Heuristic”, Boyer-Moore skips massive chunks of the text. Because it skips so aggressively, it actually becomes faster as the pattern gets longer.

# Simplified Bad Character Implementation of Boyer-Moore
def boyer_moore_search(text: str, pattern: str) -> list[int]:
    n, m = len(text), len(pattern)
    if m == 0 or m > n: return []
    
    # Preprocess: Record the right-most occurrence of every character
    bad_char = {}
    for i in range(m):
        bad_char[pattern[i]] = i
        
    matches = []
    s = 0  # Shift of the pattern with respect to text
    
    while s <= n - m:
        j = m - 1
        
        # Keep reducing index j while characters match
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
            
        if j < 0:
            # Pattern found!
            matches.append(s)
            # Shift the pattern so that the next character aligns with its last occurrence
            s += (m - bad_char.get(text[s + m], -1)) if s + m < n else 1
        else:
            # Mismatch. Shift pattern to align bad character
            # max() ensures we don't accidentally shift backwards
            s += max(1, j - bad_char.get(text[s + j], -1))
            
    return matches

Complexity

Time Complexity: Best case reaches an astonishing $O(n / m)$. As the pattern gets larger, it takes less time.
Space Complexity: $O(\Sigma)$ where $\Sigma$ is the size of the alphabet.

Algorithm Selection Guide

Algorithm	Average Time	Space	Best Use Case
Naive	$O(n \times m)$	$O(1)$	Tiny strings, simple scripts.
KMP	$O(n+m)$	$O(m)$	Guaranteeing worst-case linear time. DNA sequences with small alphabets (A,C,T,G).
Rabin-Karp	$O(n+m)$	$O(1)$	Searching for multiple patterns concurrently (Plagiarism matching).
Boyer-Moore	$O(n/m)$	$O(\Sigma)$	Most real-world production text engines, editors, and `grep`.

Summary

String matching forms the backbone of digital search.

Use KMP when you absolutely cannot afford worst-case performance degradation.
Rely on Rabin-Karp’s rolling hash when you need to match 10,000 different banned words against a document in one pass.
Use Boyer-Moore for almost all standard day-to-day text searching, as its right-to-left scanning provides incredibly fast linear skipping.

Advanced Multiple-Pattern Matching: Aho-Corasick

When you need to search an input text array for massive dictionaries (e.g., thousands of malicious virus signatures, or a dictionary of swear words for an online forum), running KMP or Boyer-Moore thousand times sequentially becomes an $O(\text{patterns} \times n)$ bottleneck.

Automaton Machines

The Aho-Corasick algorithm elegantly solves this by constructing a massive Trie (Prefix Tree) out of the entire dictionary of patterns, and then converting that Trie into a specialized Finite State Machine.

As you stream the enormous text exactly once, you trace down the nodes of the Trie. If a mismatch occurs, Aho-Corasick relies on pre-computed “Failure Links”—extremely similar to KMP’s LPS array—that safely redirect your state to the next longest possible matching prefix without ever reversing the text index.

This results in a breathtaking $O(n + m + z)$ time complexity, where:

$n$ is the length of the text.
$m$ is the combined length of every single pattern in the dictionary.
$z$ is the total count of matches found.

Most modern Intrusion Detection Systems (IDS/IPS) and Unix fgrep depend entirely on Aho-Corasick.

String Matching Algorithms: KMP, Rabin-Karp, and Boyer-Moore Deep Dive

Introduction

1. The Naive Approach (Brute Force)

Python Implementation

Searching with KMP

Complexity

4. Boyer-Moore Algorithm (The Industry Standard)

The Bad Character Heuristic

Complexity

Algorithm Selection Guide

Summary

Advanced Multiple-Pattern Matching: Aho-Corasick

Automaton Machines

Resources

Comments

Share this article

👍 Was this article helpful?