Rabin-Karp Algorithm - CSU083 | Shoolini University

Rabin-Karp Algorithm

1. Prerequisites

Before understanding the Rabin-Karp algorithm, you need to be familiar with:

2. What is the Rabin-Karp Algorithm?

The Rabin-Karp algorithm is a string searching algorithm that finds occurrences of a pattern within a text using hashing. Instead of directly comparing characters, it converts substrings into hash values and compares these values to detect matches efficiently.

Steps:

  1. Compute the hash of the pattern.
  2. Compute the hash of the first substring in the text (same length as the pattern).
  3. Slide over the text, updating the hash using a rolling hash function.
  4. If the hash matches the pattern’s hash, do a character-by-character check to confirm.

3. Why Does This Algorithm Exist?

The Rabin-Karp algorithm was developed to efficiently find patterns in large bodies of text. Its real-world applications include:

4. When Should You Use It?

Rabin-Karp is ideal when:

It is less useful when:

5. How Does It Compare to Alternatives?

Strengths:

Weaknesses:

Comparison with Other Algorithms:

"
Algorithm Best Case Worst Case Use Case
Rabin-Karp \(O(n)\) \(O(nm)\) (with collisions) Multiple pattern matching
KMP (Knuth-Morris-Pratt) \(O(n)\) \(O(n)\) Single pattern, long text
Boyer-Moore \(O(n/m)\) \(O(nm)\) Long text, short patterns

6. Basic Implementation (Code & Dry Run)

6.1 Python Implementation

Below is the basic implementation of the Rabin-Karp algorithm in Python:

def rabin_karp(text, pattern, prime=101):
    d = 256  # Number of characters in the input alphabet
    m = len(pattern)
    n = len(text)
    h = 1  # Hash multiplier
    pattern_hash = 0
    text_hash = 0
    results = []
    
    # Compute the value of h = d^(m-1) % prime
    for i in range(m - 1):
        h = (h * d) % prime

    # Compute initial hash values for the pattern and the first window of text
    for i in range(m):
        pattern_hash = (d * pattern_hash + ord(pattern[i])) % prime
        text_hash = (d * text_hash + ord(text[i])) % prime

    # Slide the pattern over the text
    for i in range(n - m + 1):
        # Check hash values
        if pattern_hash == text_hash:
            # Confirm match by checking characters
            if text[i:i + m] == pattern:
                results.append(i)

        # Compute hash of next window by removing first character and adding next
        if i < n - m:
            text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % prime
            if text_hash < 0:
                text_hash += prime  # Ensure positive hash values

    return results

# Example usage
text = "ABABDABACDABABCABAB"
pattern = "ABAB"
print(rabin_karp(text, pattern))  # Expected Output: [0, 10, 15]

7. Dry Run of Rabin-Karp Algorithm

7.1 Example Input

7.2 Step-by-Step Execution

"
Step Action Pattern Hash Text Hash Match?
1 Compute pattern hash ("ABAB") 88 - -
2 Compute initial text hash ("ABAB") - 88 ✅ (Character check confirms match at index 0)
3 Slide window to "BABD" 88 31
4 Slide window to "ABDA" 88 54
5 Slide window to "BDAB" 88 79
6 Slide window to "DABA" 88 23
7 Slide window to "ABAC" 88 94
8 Slide window to "BACD" 88 78
9 Slide window to "ACDA" 88 3
10 Slide window to "CDAB" 88 41
11 Slide window to "DABA" 88 23
12 Slide window to "ABAB" 88 88 ✅ (Character check confirms match at index 10)
13 Slide window to "BABC" 88 36
14 Slide window to "ABCA" 88 25
15 Slide window to "BCAB" 88 46
16 Slide window to "CABA" 88 67
17 Slide window to "ABAB" 88 88 ✅ (Character check confirms match at index 15)

7.3 Final Matches

The pattern "ABAB" is found at indices: [0, 10, 15]

8. Time & Space Complexity Analysis

8.1 Best-Case Complexity

When there are no hash collisions, the algorithm only computes hashes and slides the window:

8.2 Worst-Case Complexity

If hash collisions occur frequently, every match requires a character-by-character verification:

8.3 Average-Case Complexity

When hash collisions are rare, verification is not needed, leading to:

9. Space Complexity Analysis

9.1 Space Consumption Breakdown

9.2 How Space Scales with Input

10. Understanding Trade-offs

10.1 Strengths

10.2 Weaknesses

10.3 When to Choose Rabin-Karp

11. Optimizations & Variants

11.1 Common Optimizations

11.2 Variants of Rabin-Karp

12. Iterative vs. Recursive Implementations

12.1 Iterative Implementation

def rabin_karp_iterative(text, pattern, prime=101):
    d = 256  
    m, n = len(pattern), len(text)
    h, pattern_hash, text_hash = 1, 0, 0
    results = []

    for i in range(m - 1):
        h = (h * d) % prime

    for i in range(m):
        pattern_hash = (d * pattern_hash + ord(pattern[i])) % prime
        text_hash = (d * text_hash + ord(text[i])) % prime

    for i in range(n - m + 1):
        if pattern_hash == text_hash and text[i:i + m] == pattern:
            results.append(i)

        if i < n - m:
            text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i + m])) % prime
            if text_hash < 0:
                text_hash += prime

    return results

12.2 Recursive Implementation

def rabin_karp_recursive(text, pattern, prime=101, i=0, text_hash=None, pattern_hash=None, h=None, results=None):
    if results is None:
        results = []
        d = 256  
        m, n = len(pattern), len(text)
        h = 1  

        for _ in range(m - 1):
            h = (h * d) % prime

        pattern_hash, text_hash = 0, 0
        for j in range(m):
            pattern_hash = (d * pattern_hash + ord(pattern[j])) % prime
            text_hash = (d * text_hash + ord(text[j])) % prime

    if i > len(text) - len(pattern):
        return results

    if pattern_hash == text_hash and text[i:i + len(pattern)] == pattern:
        results.append(i)

    if i < len(text) - len(pattern):
        text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i + len(pattern)])) % prime
        if text_hash < 0:
            text_hash += prime

    return rabin_karp_recursive(text, pattern, prime, i + 1, text_hash, pattern_hash, h, results)

12.3 Efficiency Comparison

"
Approach Time Complexity Space Complexity Best Use Case
Iterative \(O(n)\) average, \(O(nm)\) worst \(O(1)\) Standard applications, better for large inputs
Recursive \(O(n)\) average, \(O(nm)\) worst \(O(n)\) (due to recursion stack) Conceptual understanding, small input sizes

12.4 Conclusion

13. Edge Cases & Failure Handling

13.1 Common Pitfalls & Edge Cases

14. Test Cases to Verify Correctness

14.1 Basic Test Cases

def test_rabin_karp():
    text1 = "ABABDABACDABABCABAB"
    pattern1 = "ABAB"
    assert rabin_karp(text1, pattern1) == [0, 10, 15], "Test Case 1 Failed"

    text2 = "HELLO WORLD"
    pattern2 = "WORLD"
    assert rabin_karp(text2, pattern2) == [6], "Test Case 2 Failed"

    text3 = "AAAAAAAAAA"
    pattern3 = "AAA"
    assert rabin_karp(text3, pattern3) == [0, 1, 2, 3, 4, 5, 6, 7], "Test Case 3 Failed"

    text4 = "abcdef"
    pattern4 = "gh"
    assert rabin_karp(text4, pattern4) == [], "Test Case 4 Failed"

    text5 = ""
    pattern5 = "A"
    assert rabin_karp(text5, pattern5) == [], "Test Case 5 Failed"

    text6 = "A"
    pattern6 = ""
    assert rabin_karp(text6, pattern6) == [], "Test Case 6 Failed"

    text7 = "ABCD"
    pattern7 = "ABCDEFG"
    assert rabin_karp(text7, pattern7) == [], "Test Case 7 Failed"

    print("All test cases passed!")

# Run the tests
test_rabin_karp()

14.2 Explanation of Test Cases

15. Real-World Failure Scenarios

15.1 Hash Collisions in Large Text Data

Scenario: In large text datasets (e.g., DNA sequencing), hash collisions can trigger many false positives, reducing efficiency.

Solution: Use a stronger hash function or double hashing.

15.2 Unicode & Multilingual Text Issues

Scenario: In languages like Chinese, Arabic, or Japanese, character encoding variations can alter hash values.

Solution: Normalize text encoding before hashing.

15.3 Scaling for Massive Inputs

Scenario: Searching for a pattern in terabytes of text (e.g., search engines) requires significant optimizations.

Solution: Use distributed processing (e.g., MapReduce) to parallelize computation.

15.4 Network Intrusion Detection

Scenario: Detecting malicious patterns in network packets using Rabin-Karp.

Problem: High hash collision rate leads to excessive verification.

Solution: Implement rolling hash with adaptive filtering.

15.5 Case Sensitivity & Special Characters

Scenario: Searching for "hello" should match "HELLO" in case-insensitive searches.

Solution: Convert all characters to lowercase before hashing.

16. Real-World Applications & Industry Use Cases

16.1 Plagiarism Detection

Use Case: Many plagiarism detection tools use Rabin-Karp to quickly locate matching phrases across documents.

How it works: Convert documents into hash fingerprints and compare them across a large database.

16.2 Spam Filtering

Use Case: Email filtering systems use Rabin-Karp to detect patterns commonly found in spam emails.

How it works: Predefined spam phrases are hashed and compared with incoming emails to flag spam messages.

16.3 Digital Forensics & Cybersecurity

Use Case: Cybersecurity tools use Rabin-Karp to identify malware signatures in files and network packets.

How it works: Hash known malicious patterns and scan system logs for matches.

16.4 DNA Sequence Matching

Use Case: Bioinformatics applications use Rabin-Karp for fast pattern matching in genetic data.

How it works: Convert DNA sequences into hash values and scan large datasets for known patterns.

16.5 Search Engines & Text Indexing

Use Case: Search engines use Rabin-Karp to locate exact phrase matches efficiently.

How it works: Hash indexed text segments and compare them against user queries.

17. Open-Source Implementations

Several open-source libraries implement Rabin-Karp in optimized ways:

17.1 Python - Scikit-learn's String Matching

Scikit-learn includes fast implementations for approximate string matching, using variations of Rabin-Karp.

Repo: GitHub - scikit-learn

17.2 Java - Apache Commons StringUtils

The Apache Commons library provides efficient substring search implementations, including Rabin-Karp.

Repo: GitHub - Apache Commons

17.3 C++ - GNU Grep (Pattern Matching in Linux)

GNU Grep, a high-performance text search tool, uses Rabin-Karp for certain types of pattern searches.

Repo: GNU Grep Source Code

18. Practical Project: File Similarity Detector

18.1 Objective

We will create a Python script that checks the similarity between two text files using the Rabin-Karp algorithm.

18.2 Implementation

import os

def compute_hash(text, prime=101):
    """Computes hash for a given text"""
    d = 256
    hash_value = 0
    for char in text:
        hash_value = (d * hash_value + ord(char)) % prime
    return hash_value

def rabin_karp_similarity(file1, file2, window_size=10):
    """Compares similarity between two text files using Rabin-Karp"""
    with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
        text1, text2 = f1.read(), f2.read()

    hashes1 = {compute_hash(text1[i:i+window_size]) for i in range(len(text1) - window_size + 1)}
    hashes2 = {compute_hash(text2[i:i+window_size]) for i in range(len(text2) - window_size + 1)}

    intersection = hashes1.intersection(hashes2)
    similarity = len(intersection) / max(len(hashes1), len(hashes2))

    return similarity

# Example usage
file1, file2 = "document1.txt", "document2.txt"
if os.path.exists(file1) and os.path.exists(file2):
    similarity_score = rabin_karp_similarity(file1, file2)
    print(f"File Similarity Score: {similarity_score * 100:.2f}%")
else:
    print("Ensure both files exist before running the script.")

18.3 How It Works

18.4 Applications

18.5 Future Enhancements

19. Competitive Programming & System Design Integration

19.1 Competitive Programming Use Cases

Rabin-Karp is useful in competitive programming where pattern matching needs to be done efficiently.

19.2 Types of Problems Where Rabin-Karp Excels

19.3 System Design Integration

Rabin-Karp can be integrated into large-scale systems for efficient pattern matching.

19.4 Examples of System Design Use Cases

20. Assignments & Practice Problems

20.1 Solve at Least 10 Problems Using Rabin-Karp

Try implementing Rabin-Karp for the following problems:

  1. Find all occurrences of a pattern in a given text.
  2. Find repeated substrings of length \(k\) in a large text.
  3. Check if a given string contains a rotated version of another.
  4. Find the longest repeating substring in a given text.
  5. Detect anagrams in a text using Rabin-Karp hashing.
  6. Find the first unique substring of length \(k\) in a text.
  7. Detect plagiarism between two documents.
  8. Find common substrings between two texts.
  9. Search for DNA sequence patterns efficiently.
  10. Implement a spam filter to detect predefined phrases.

20.2 Use Rabin-Karp in a System Design Problem

Design a system that efficiently searches for specific keywords in a large-scale database using Rabin-Karp.

20.3 Practice Implementing Under Time Constraints

Time yourself while implementing Rabin-Karp from scratch. Try optimizing: