String Matching Algorithms: KMP, Rabin-Karp, and Beyond

Introduction

Every time you use Ctrl+F in your browser, search for a file on your computer, or a DNA sequencer finds gene matches, efficient string matching algorithms are working behind the scenes. These algorithms answer a fundamental question: Does a pattern appear in a text, and if so, where?

While the naive approach works—checking each position—it can be too slow for large texts or many searches. Computer scientists have developed sophisticated algorithms that achieve much better performance. This article explores the most important ones.

The String Matching Problem

Given a text T of length n and a pattern P of length m (where m ≤ n), find all occurrences of P in T. Each match is reported at position i where T[i…i+m-1] = P.

This seemingly simple problem appears everywhere:

Text editors finding words
Plagiarism detection
Bioinformatics finding DNA patterns
Network security scanning for malicious signatures
Log analysis finding error patterns

The Naive Approach

The straightforward algorithm checks the pattern at each position:

def naive_search(text, pattern):
    n, m = len(text), len(pattern)
    matches = []
    for i in range(n - m + 1):
        if text[i:i+m] == pattern:
            matches.append(i)
    return matches

This takes O(n × m) time—acceptable for small texts but painfully slow when searching large documents or performing many searches.

Knuth-Morris-Pratt (KMP) Algorithm

KMP achieves O(n + m) time by preprocessing the pattern to understand its own repetitive structure. The key insight is that when a mismatch occurs, we can use information about the pattern itself to avoid redundant comparisons.

The Failure Function:

First, compute the longest proper prefix that is also a suffix for each prefix of the pattern. This tells us how far we can shift the pattern when a mismatch occurs.

def compute_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

Searching with KMP:

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)
    matches = []
    i = j = 0
    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            matches.append(i - j)
            j = lps[j - 1]
        elif i < n and text[i] != pattern[j]:
            j = 0 if j == 0 else lps[j - 1]
    return matches

The genius of KMP is that it never re-examines characters in the text. Each character is compared at most twice, giving O(n + m) complexity.

Rabin-Karp Algorithm

Rabin-Karp uses hashing to find patterns. The idea is simple: compute a hash value for the pattern, then slide through the text computing hash values for each substring of length m. When hashes match, verify with a direct string comparison.

Rolling Hash:

The key is efficiently computing hash values as we slide. Using a rolling hash, we can update the hash in O(1) time when shifting one position:

def rabin_karp(text, pattern, base=256, mod=10**9 + 9):
    n, m = len(text), len(pattern)
    pattern_hash = hash(pattern[:m])
    text_hash = hash(text[:m])
    matches = []
    for i in range(n - m + 1):
        if text_hash == pattern_hash and text[i:i+m] == pattern:
            matches.append(i)
        if i < n - m:
            text_hash = (text_hash - ord(text[i]) * pow(base, m-1, mod)) * base + ord(text[i+m])
            text_hash %= mod
    return matches

The expected time is O(n + m), though worst case can be O(n × m) with hash collisions. Using a good hash function and large modulus makes collisions extremely rare.

When to Use Each Algorithm

KMP is best when:

You need guaranteed O(n + m) worst-case performance
The alphabet is small
You need to find the first occurrence efficiently

Rabin-Karp is best when:

You’re searching for multiple patterns (use multiple hashes)
The alphabet is large
You want simpler implementation than KMP

Other algorithms to know:

Boyer-Moore: Works backwards from pattern end, very fast for large alphabets
Z-algorithm: Similar to KMP but works on concatenated string
Aho-Corasick: Multiple pattern matching in O(n + m + output)

Practical Applications

These algorithms power real systems:

DNA sequencing: Finding gene sequences in massive DNA databases.

Search engines: Indexing and querying text efficiently.

Virus scanning: Matching file contents against virus signatures.

Log analysis: Finding error patterns in system logs.

Spelling checkers: Dictionary lookup and correction suggestions.

Optimizations in Practice

Real-world implementations include additional optimizations:

Suffix arrays and suffix trees: For many searches on the same text, build an index first.

Bitset parallelism: Use CPU word operations to check multiple positions at once.

SIMD instructions: Process multiple characters simultaneously.

Approximate matching: Allow mismatches for fuzzy searching.

External Resources

Conclusion

String matching algorithms are fundamental tools in any programmer’s arsenal. KMP and Rabin-Karp represent different trade-offs—KMP provides guaranteed performance, while Rabin-Karp is often simpler to implement and works well in practice.

Understanding these algorithms gives you insight into how everyday tools work and prepares you for problems involving text processing, which appear frequently in both interviews and real-world applications.