Knuth-Morris-Pratt Algorithm - CSU083 | Shoolini University

Knuth-Morris-Pratt Algorithm (KMP)

1. Prerequisites: Foundational Concepts

Before understanding the Knuth-Morris-Pratt (KMP) algorithm, you need to be familiar with the following:

1.1 String Matching Basics

Understanding how to find a substring within a larger string using brute-force comparison.

1.2 Prefix and Suffix Concepts

A prefix of a string is a substring starting from the first character, and a suffix is a substring ending at the last character.

1.3 Time Complexity Analysis

Concepts of Big-O notation, particularly understanding how $O(n)$, $O(mn)$, and $O(log n)$ affect algorithm performance.

1.4 Data Structures

Basic understanding of arrays and pattern tables (partial match tables).

2. Core Concept: What is the Knuth-Morris-Pratt Algorithm?

The Knuth-Morris-Pratt (KMP) algorithm is an efficient pattern matching algorithm that finds occurrences of a pattern in a text in $O(n + m)$ time, where:

Unlike the brute-force approach ($O(nm)$), KMP avoids redundant comparisons by using a preprocessing step to build a partial match table (LPS - Longest Prefix Suffix).

2.1 How Does KMP Work?

KMP consists of two main phases:

2.1.1 LPS Array Example

For the pattern ABABCABAB, the LPS array is:

A  B  A  B  C  A  B  A  B
0  0  1  2  0  1  2  3  4

The LPS array tells us the longest prefix that is also a suffix, helping us skip unnecessary comparisons.

3. Why Does This Algorithm Exist?

The KMP algorithm exists to solve the inefficiencies of brute-force string searching by reducing redundant comparisons. It is widely used in:

3.1 Search Engines

Efficient keyword searching in large documents.

3.2 DNA Sequence Matching

Finding genetic patterns in biological research.

3.3 Spam Detection

Identifying spam patterns in emails.

3.4 Plagiarism Detection

Finding exact matches of text across large datasets.

4. When Should You Use It?

KMP is most effective when:

4.1 When Not to Use KMP?

5. How Does It Compare to Alternatives?

5.1 Strengths

5.2 Weaknesses

5.3 Comparison with Other String Matching Algorithms

"
Algorithm Time Complexity Best Use Case
Brute Force $O(nm)$ Small datasets
KMP $O(n + m)$ Long texts with repetitive patterns
Rabin-Karp $O(n + m)$ (average), $O(nm)$ (worst) Multiple pattern searches
Boyer-Moore $O(n/m)$ (best), $O(nm)$ (worst) Searching in large texts with distinct patterns

6. Basic Implementation

The following is the basic implementation of the Knuth-Morris-Pratt (KMP) algorithm in Python. The implementation consists of two parts:

6.1 Python Implementation

def compute_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0  # Length of the previous longest prefix suffix
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)

    i = j = 0  # i -> text index, j -> pattern index
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1

        if j == m:
            print(f"Pattern found at index {i - j}")
            j = lps[j - 1]  # Move j back to check for further occurrences
        elif i < n and pattern[j] != text[i]:  
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1  # Move to next character in text

# Example usage:
text = "ABABDABACDABABCABAB"
pattern = "ABABCABAB"
kmp_search(text, pattern)

This implementation efficiently searches for the pattern in the text using the precomputed LPS array.

7. Dry Run of the Algorithm

Let’s dry run the algorithm step by step with the input:

7.1 Step 1: Compute LPS Array for Pattern

For ABABCABAB, the LPS array is computed as follows:

A  B  A  B  C  A  B  A  B
0  0  1  2  0  1  2  3  4

This array helps us skip redundant comparisons.

7.2 Step 2: Start Matching in Text

We begin comparing the pattern with the text at each position:

"
Step i (Text Index) j (Pattern Index) Action
1 0 0 Match: A == A
2 1 1 Match: B == B
3 2 2 Match: A == A
4 3 3 Match: B == B
5 4 4 Mismatch: C != D, Use LPS to shift
6 10 8 Pattern found at index 10

7.3 Explanation of Shifting Using LPS

Whenever a mismatch occurs at a position j, instead of restarting, we use the LPS array to move the pattern to the next possible matching position.

7.4 Final Output

Pattern found at index 10

The pattern ABABCABAB is found in the text starting at index 10.

8. Time & Space Complexity Analysis

The Knuth-Morris-Pratt (KMP) algorithm operates efficiently by reducing redundant comparisons using the LPS (Longest Prefix Suffix) array. Below is a detailed complexity analysis.

8.1 Time Complexity Breakdown

The KMP algorithm consists of two major phases:

8.1.1 Preprocessing the Pattern (Building LPS Array)
8.1.2 Searching the Pattern in Text

8.2 Worst-Case Time Complexity

Occurs when the pattern has a repetitive structure, causing frequent mismatches.

Time Complexity: $$O(n + m)$$

8.3 Best-Case Time Complexity

Occurs when there are no mismatches or when the pattern doesn't appear in the text.

Time Complexity: $$O(n)$$

8.4 Average-Case Time Complexity

Generally, the algorithm performs close to its best-case due to efficient skipping.

Time Complexity: $$O(n + m)$$

9. Space Complexity Analysis

9.1 Space Consumption

The KMP algorithm mainly consumes space for the LPS array:

9.2 Space Complexity Formula

Overall Space Complexity: $$O(m)$$

9.3 How Space Grows with Input Size

Since only the pattern length m affects memory usage, the space requirement remains minimal and grows linearly with the pattern size.

10. Trade-offs in Using KMP

10.1 Strengths

10.2 Weaknesses

10.3 Comparison with Alternatives

"
Algorithm Time Complexity Space Complexity Best Use Case
Brute Force O(nm) O(1) Small datasets
KMP O(n + m) O(m) Exact matching in large texts
Rabin-Karp O(n + m) (avg), O(nm) (worst) O(1) Multiple pattern searches
Boyer-Moore O(n/m) (best), O(nm) (worst) O(m) Fast search in distinct patterns

11. Optimizations & Variants

The Knuth-Morris-Pratt (KMP) algorithm is already optimized for linear-time pattern searching, but additional improvements can further enhance its efficiency in practical applications.

11.1 Common Optimizations

11.1.1 Optimized LPS Computation

The preprocessing step for the LPS array can be fine-tuned by minimizing unnecessary comparisons:

11.1.2 Avoiding Unnecessary Comparisons
11.1.3 Early Termination

If the remaining characters in the text are fewer than the unmatched part of the pattern, terminate early.

11.2 Variants of KMP

11.2.1 KMP for Multiple Patterns

Instead of using KMP separately for each pattern, preprocessing multiple patterns into a single LPS table is useful in:

11.2.2 Approximate Matching

Traditional KMP works for exact matches, but can be extended using:

11.2.3 Parallelized KMP

Modern hardware supports running different parts of the search on multiple cores, improving performance on large texts.

11.2.4 KMP with Compressed LPS

Instead of storing the entire LPS array, only necessary transition points can be stored to save memory.

12. Iterative vs. Recursive Implementations

12.1 Iterative KMP Implementation

The standard KMP implementation is iterative, using loops for LPS computation and pattern matching.

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)
    
    i = j = 0  # Pointers for text and pattern
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1

        if j == m:
            print(f"Pattern found at index {i - j}")
            j = lps[j - 1]  # Move j back using LPS
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1  # Move forward in text

12.2 Recursive KMP Implementation

A recursive version eliminates loops by using function calls:

def recursive_kmp(text, pattern, i=0, j=0, lps=None):
    if lps is None:
        lps = compute_lps(pattern)
    
    if i >= len(text):
        return
    
    if j == len(pattern):
        print(f"Pattern found at index {i - j}")
        return recursive_kmp(text, pattern, i, lps[j - 1], lps)
    
    if text[i] == pattern[j]:
        return recursive_kmp(text, pattern, i + 1, j + 1, lps)
    
    if j != 0:
        return recursive_kmp(text, pattern, i, lps[j - 1], lps)
    else:
        return recursive_kmp(text, pattern, i + 1, j, lps)

12.3 Efficiency Comparison

"
Feature Iterative Recursive
Time Complexity O(n + m) O(n + m)
Space Complexity O(m) (only LPS stored) O(m + d) (LPS + recursion depth)
Stack Usage Minimal May cause stack overflow for large inputs
Performance Faster due to loop optimization Slower due to recursive calls

12.4 When to Use Each

13. Edge Cases & Failure Handling

The Knuth-Morris-Pratt (KMP) algorithm generally performs efficiently, but certain edge cases can cause unexpected behavior. Below are some of the most critical pitfalls and how to handle them.

13.1 Edge Cases

13.1.1 Pattern Longer than Text

If the pattern length is greater than the text length, the search should terminate immediately.

if len(pattern) > len(text):
    return "Pattern length exceeds text length. No match possible."
13.1.2 Pattern is Empty

If the pattern is an empty string, it should match at every index of the text.

if len(pattern) == 0:
    return "Empty pattern matches at all positions."
13.1.3 Text is Empty

If the text is empty, no matches should be found.

if len(text) == 0:
    return "No match in an empty text."
13.1.4 Repeating Characters in Pattern

Patterns with many repeating characters (e.g., "aaaaa") should not cause excessive LPS recalculations.

13.1.5 No Match Exists

If no part of the text matches the pattern, the algorithm should terminate gracefully.

13.1.6 Multiple Matches

The algorithm should handle multiple occurrences of the pattern within the text.

13.1.7 Case Sensitivity

By default, KMP is case-sensitive, meaning "ABC" and "abc" are treated as different. To handle case-insensitive searches, convert both text and pattern to lowercase.

text = text.lower()
pattern = pattern.lower()

14. Test Cases to Verify Correctness

To ensure the correctness of the KMP algorithm, we need to validate various scenarios using test cases.

14.1 Basic Test Cases

"
Test Case Text Pattern Expected Output
Pattern found once "abcdeabcabc" "abc" Found at index 0, 5, 8
Pattern not found "abcdefgh" "xyz" No match found
Pattern longer than text "abc" "abcd" No match found
Pattern is empty "abcdef" "" Matches at all positions
Text is empty "" "abc" No match found
Multiple occurrences "ababababc" "ab" Found at index 0, 2, 4, 6

14.2 Python Unit Tests

import unittest

class TestKMP(unittest.TestCase):
    def test_single_match(self):
        self.assertEqual(kmp_search("abcdeabcabc", "abc"), [0, 5, 8])

    def test_no_match(self):
        self.assertEqual(kmp_search("abcdefgh", "xyz"), [])

    def test_pattern_longer_than_text(self):
        self.assertEqual(kmp_search("abc", "abcd"), [])

    def test_empty_pattern(self):
        self.assertEqual(kmp_search("abcdef", ""), list(range(len("abcdef"))))

    def test_empty_text(self):
        self.assertEqual(kmp_search("", "abc"), [])

    def test_multiple_occurrences(self):
        self.assertEqual(kmp_search("ababababc", "ab"), [0, 2, 4, 6])

if __name__ == "__main__":
    unittest.main()

15. Real-World Failure Scenarios

15.1 Handling Large Text Inputs

When dealing with very large texts (e.g., genome sequences, logs, or books), KMP’s efficiency remains linear, but:

15.2 Handling Unicode & Special Characters

KMP must handle non-ASCII characters properly, such as:

import unicodedata
text = unicodedata.normalize('NFC', text)
pattern = unicodedata.normalize('NFC', pattern)

15.3 Case-Insensitive Matching

By default, "abc" ≠ "ABC". To perform case-insensitive matching:

text = text.lower()
pattern = pattern.lower()

15.4 Real-Time Streaming Search

For real-time applications (e.g., autocomplete, network packet inspection), KMP needs to work on streaming data:

15.5 Pattern Changing Dynamically

If the pattern changes frequently, recomputing the LPS array can be expensive. Instead, a hybrid approach using:

16. Real-World Applications & Industry Use Cases

The Knuth-Morris-Pratt (KMP) algorithm is widely used in various real-world applications where efficient substring searching is required.

16.1 Search Engines & Text Processing

16.2 Bioinformatics

16.3 Network Security & Cybersecurity

16.4 E-Commerce & Digital Marketing

16.5 Software Development & Compilers

17. Open-Source Implementations

Several open-source projects implement KMP for different applications.

17.1 Notable Repositories

17.2 How to Contribute

18. Practical Project: Log File Keyword Search

This project implements a simple log file scanner that detects specific keywords in server logs using the KMP algorithm.

18.1 Use Case

Detect security incidents by scanning server logs for patterns like "unauthorized access", "SQL injection", or "error 500".

18.2 Python Implementation

import os

def compute_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0  
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)

    i = j = 0  
    occurrences = []
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1

        if j == m:
            occurrences.append(i - j)
            j = lps[j - 1]  
        elif i < n and pattern[j] != text[i]:  
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1  
    return occurrences

def scan_log_file(log_file, keywords):
    with open(log_file, "r", encoding="utf-8") as file:
        lines = file.readlines()
    
    for i, line in enumerate(lines):
        for keyword in keywords:
            matches = kmp_search(line, keyword)
            if matches:
                print(f"Match for '{keyword}' found in line {i+1}: {line.strip()}")

# Example usage:
log_file_path = "server_logs.txt"
search_patterns = ["unauthorized access", "error 500", "SQL injection"]
scan_log_file(log_file_path, search_patterns)

18.3 How It Works

18.4 Future Enhancements

19. Competitive Programming & System Design Integration

19.1 KMP in Competitive Programming

The Knuth-Morris-Pratt (KMP) algorithm is frequently used in competitive programming for string-matching problems. Given its efficiency ($O(n + m)$ complexity), it is ideal for handling large inputs.

19.1.1 Common Problem Patterns
19.1.2 Problem-Solving Strategy
  1. Identify if the problem requires substring searching.
  2. Precompute the LPS array if necessary.
  3. Apply KMP instead of naive or brute-force approaches.
  4. Optimize for edge cases (large inputs, multiple matches).

19.2 KMP in System Design

In large-scale applications, KMP is used in various system design scenarios:

19.2.1 Distributed Search Engines
19.2.2 Real-Time Log Analysis
19.2.3 Bioinformatics Systems
19.2.4 Spam Detection & Filtering

20. Assignments

20.1 Solve at Least 10 Problems Using KMP

Practice problems to reinforce your understanding of KMP:

  1. NHAY - A Needle in the Haystack (SPOJ)
  2. Pattern Searching (GFG)
  3. Implement strStr() (Leetcode 28)
  4. TASHIFT - The Shift (CodeChef)
  5. The Power Sum (HackerRank)
  6. Find all occurrences of a pattern in a string with overlapping matches.
  7. Find the longest border of a string using KMP’s LPS array.
  8. Check if one string is a rotation of another using KMP.
  9. Find the longest repeating substring in a given text using KMP.
  10. Count the number of times a pattern appears in a large document efficiently.

20.2 Use KMP in a System Design Problem

Design a scalable log processing system where multiple servers generate logs, and a monitoring system scans for predefined error messages using KMP.

Steps:

20.3 Implement KMP Under Time Constraints

To improve your ability to implement KMP efficiently in a competitive environment, follow these constraints:

Suggested platforms for timed practice:

By solving these assignments, you'll gain confidence in implementing KMP efficiently and applying it to real-world problems.