Aho-Corasick Algorithm - CSU083 | Shoolini University

Aho-Corasick Algorithm

1. Prerequisites

Before understanding the Aho-Corasick algorithm, you need familiarity with:

2. What is the Aho-Corasick Algorithm?

The Aho-Corasick algorithm is an efficient string-searching algorithm that finds multiple patterns in a given text simultaneously.

The algorithm processes the text in O(N + M + Z) time complexity, where:

3. Why Does This Algorithm Exist?

The Aho-Corasick algorithm is designed to perform fast multi-pattern searching, useful in:

4. When Should You Use It?

The Aho-Corasick algorithm is best used when:

5. How Does It Compare to Alternatives?

5.1 Strengths

5.2 Weaknesses

6. Basic Implementation

6.1 Python Implementation

The following Python implementation demonstrates the Aho-Corasick algorithm for multi-pattern searching.


from collections import deque

class TrieNode:
    def __init__(self):
        self.children = {}
        self.fail_link = None
        self.output = []

class AhoCorasick:
    def __init__(self, patterns):
        self.root = TrieNode()
        self.build_trie(patterns)
        self.build_failure_links()

    def build_trie(self, patterns):
        for pattern in patterns:
            node = self.root
            for char in pattern:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.output.append(pattern)

    def build_failure_links(self):
        queue = deque()
        for key, node in self.root.children.items():
            node.fail_link = self.root
            queue.append(node)
        
        while queue:
            current = queue.popleft()
            for key, child in current.children.items():
                fail = current.fail_link
                while fail and key not in fail.children:
                    fail = fail.fail_link
                child.fail_link = fail.children[key] if fail else self.root
                child.output += child.fail_link.output
                queue.append(child)

    def search(self, text):
        node = self.root
        results = []
        for i, char in enumerate(text):
            while node and char not in node.children:
                node = node.fail_link
            if not node:
                node = self.root
                continue
            node = node.children[char]
            for match in node.output:
                results.append((i - len(match) + 1, match))
        return results

patterns = ["he", "she", "his", "hers"]
aho = AhoCorasick(patterns)
text = "ahishers"
matches = aho.search(text)
print("Matches found:", matches)

Explanation:

7. Dry Run

7.1 Input:

7.2 Step-by-Step Execution:

"
Step Character Current Node Action Output
1 'a' Root No match, stay at root. []
2 'h' h Move to node 'h'. []
3 'i' hi Move to node 'i'. []
4 's' his Pattern "his" matched. [(1, "his")]
5 'h' h Move to node 'h'. []
6 'e' he Pattern "he" matched. [(1, "his"), (4, "he")]
7 'r' her Move to node 'r'. []
8 's' hers Pattern "hers" matched. [(1, "his"), (4, "he"), (5, "hers")]

7.3 Final Matches:

Matches found: [(1, "his"), (4, "he"), (5, "hers")]

These matches indicate the starting index where each pattern occurs in the text.

8. Time & Space Complexity Analysis

8.1 Time Complexity Analysis

The Aho-Corasick algorithm operates in three main phases:

8.2 Worst-Case Complexity

The worst-case complexity occurs when all characters mismatch and failure links traverse back to the root frequently.

Total Complexity (Worst-Case): O(M × L + T + Z)

8.3 Best-Case Complexity

In the best case, failure links allow direct transitions, and patterns match early.

Total Complexity (Best-Case): O(M × L + T)

8.4 Average-Case Complexity

On average, failure links reduce unnecessary backtracking, and patterns are distributed evenly.

Total Complexity (Average-Case): O(M × L + T + Z)

9. Space Complexity Analysis

The Aho-Corasick algorithm requires additional space for:

Total Space Complexity: O(M × L)

9.1 How Space Consumption Changes With Input Size

10. Trade-Offs in Aho-Corasick Algorithm

10.1 Strengths

10.2 Weaknesses

10.3 When to Use Aho-Corasick vs. Other Algorithms

"
Algorithm Best Use Case Time Complexity Space Complexity
Aho-Corasick Multi-pattern searching in large text O(T + M × L + Z) O(M × L)
KMP Single pattern searching O(T + L) O(L)
Rabin-Karp Multiple pattern searching (small set) O(T + M) O(1)

11. Optimizations & Variants

11.1 Common Optimizations

To improve the performance of the Aho-Corasick algorithm, several optimizations can be applied:

11.2 Variants of Aho-Corasick

Different versions of the algorithm exist for specialized use cases:

12. Iterative vs. Recursive Implementations

12.1 Iterative Implementation

The most common implementation of Aho-Corasick is iterative, using a queue for BFS in failure link construction:


from collections import deque

def build_failure_links(root):
    queue = deque()
    for key, node in root.children.items():
        node.fail_link = root
        queue.append(node)
    
    while queue:
        current = queue.popleft()
        for key, child in current.children.items():
            fail = current.fail_link
            while fail and key not in fail.children:
                fail = fail.fail_link
            child.fail_link = fail.children[key] if fail else root
            queue.append(child)

12.2 Recursive Implementation

A recursive version can be used for failure link construction but suffers from stack depth issues:


def build_failure_links_recursive(node, root):
    for key, child in node.children.items():
        fail = node.fail_link
        while fail and key not in fail.children:
            fail = fail.fail_link
        child.fail_link = fail.children[key] if fail else root
        build_failure_links_recursive(child, root)

12.3 Efficiency Comparison

"
Implementation Time Complexity Space Complexity Best Use Case
Iterative (BFS-based) O(M × L) O(M × L) Large datasets, practical use
Recursive (DFS-based) O(M × L) O(M × L) + O(D) (stack depth D) Small datasets, theoretical use

13. Edge Cases & Failure Handling

13.1 Common Pitfalls and Edge Cases

While implementing Aho-Corasick, it's crucial to handle edge cases correctly:

14. Test Cases to Verify Correctness

14.1 Sample Test Cases

"
Test Case Input Patterns Input Text Expected Output
Basic Match ["he", "she", "his", "hers"] "ahishers" [(1, "his"), (4, "he"), (5, "hers")]
Empty Patterns [] "some random text" []
Pattern Larger than Text ["longpattern"] "short" []
Overlapping Matches ["ab", "abc", "bc"] "abc" [(0, "ab"), (0, "abc"), (1, "bc")]
Case Sensitivity ["hello"] "Hello world" [] (if case-sensitive), [(0, "hello")] (if case-insensitive)
Special Characters ["$money$", "#tag"] "earn $money$ fast! Use #tag" [(5, "$money$"), (20, "#tag")]
Multiple Occurrences ["abc"] "abc abc abc" [(0, "abc"), (4, "abc"), (8, "abc")]
Streaming Input ["data"] "incoming datastream containing data" [(9, "data"), (28, "data")]

14.2 Python Code for Testing

To ensure correctness, write test functions:


def run_tests():
    patterns = [
        (["he", "she", "his", "hers"], "ahishers", [(1, "his"), (4, "he"), (5, "hers")]),
        ([], "some random text", []),
        (["longpattern"], "short", []),
        (["ab", "abc", "bc"], "abc", [(0, "ab"), (0, "abc"), (1, "bc")]),
        (["hello"], "Hello world", []),  # Case-sensitive check
        (["$money$", "#tag"], "earn $money$ fast! Use #tag", [(5, "$money$"), (20, "#tag")]),
        (["abc"], "abc abc abc", [(0, "abc"), (4, "abc"), (8, "abc")]),
        (["data"], "incoming datastream containing data", [(9, "data"), (28, "data")])
    ]

    aho = AhoCorasick([])
    for pattern_set, text, expected in patterns:
        aho.__init__(pattern_set)
        result = aho.search(text)
        assert result == expected, f"Test failed for input: {text}"

    print("All tests passed!")

run_tests()

15. Real-World Failure Scenarios

15.1 Failure Scenarios in Practical Applications

Even well-designed Aho-Corasick implementations can fail in certain real-world situations:

15.2 Mitigation Strategies

16. Real-World Applications & Industry Use Cases

16.1 How is Aho-Corasick Used in Real-World Applications?

The Aho-Corasick algorithm is widely used across industries due to its ability to efficiently search for multiple patterns. Some key applications include:

17. Open-Source Implementations

Several open-source implementations of the Aho-Corasick algorithm are available for different programming languages:

These implementations allow developers to integrate Aho-Corasick into their applications efficiently.

18. Practical Project: Implementing Aho-Corasick for Text Filtering

18.1 Project Overview

We will create a script that uses the Aho-Corasick algorithm to scan text for prohibited words and replace them with asterisks (***) for chat moderation.

18.2 Python Implementation


from collections import deque

class TrieNode:
    def __init__(self):
        self.children = {}
        self.fail_link = None
        self.output = []

class AhoCorasickFilter:
    def __init__(self, banned_words):
        self.root = TrieNode()
        self.build_trie(banned_words)
        self.build_failure_links()

    def build_trie(self, banned_words):
        for word in banned_words:
            node = self.root
            for char in word:
                if char not in node.children:
                    node.children[char] = TrieNode()
                node = node.children[char]
            node.output.append(word)

    def build_failure_links(self):
        queue = deque()
        for key, node in self.root.children.items():
            node.fail_link = self.root
            queue.append(node)
        
        while queue:
            current = queue.popleft()
            for key, child in current.children.items():
                fail = current.fail_link
                while fail and key not in fail.children:
                    fail = fail.fail_link
                child.fail_link = fail.children[key] if fail else self.root
                child.output += child.fail_link.output
                queue.append(child)

    def censor_text(self, text):
        node = self.root
        output_text = list(text)

        i = 0
        while i < len(text):
            char = text[i]
            while node and char not in node.children:
                node = node.fail_link
            if not node:
                node = self.root
                i += 1
                continue
            node = node.children[char]
            for match in node.output:
                start = i - len(match) + 1
                output_text[start:i+1] = '*' * len(match)
            i += 1

        return ''.join(output_text)

# Example usage
banned_words = ["bad", "ugly", "offensive"]
filter_system = AhoCorasickFilter(banned_words)
text = "This is a bad and ugly statement!"
filtered_text = filter_system.censor_text(text)
print("Filtered Output:", filtered_text)

18.3 Expected Output


Filtered Output: This is a *** and **** statement!

18.4 Key Features

19. Competitive Programming & System Design Integration

19.1 Competitive Programming Use Cases

The Aho-Corasick algorithm is frequently used in coding competitions due to its efficiency in handling multi-pattern searches. It is ideal for problems that involve:

Key platforms where Aho-Corasick problems appear:

19.2 System Design Integration

In real-world systems, Aho-Corasick is used in:

19.3 Scaling Aho-Corasick for Large Systems

20. Assignments & Practice Problems

20.1 Solve at Least 10 Problems Using Aho-Corasick

Practice problems to master the algorithm:

  1. Multi-pattern search in text (Find multiple words in a given paragraph).
  2. Spam word detection (Filter messages with banned words).
  3. Plagiarism checker (Detect copied sentences in an essay).
  4. DNA sequence search (Find patterns in genome data).
  5. Autocomplete system (Suggest words while typing).
  6. Intrusion detection system (IDS) (Find malicious keywords in network logs).
  7. Keyword-based sentiment analysis (Detect positive/negative phrases in reviews).
  8. Hashtag monitoring system (Track trending hashtags in tweets).
  9. Log error detection (Find specific error codes in server logs).
  10. Profanity filter (Censor inappropriate words in chat applications).

20.2 Implement Aho-Corasick in a System Design Problem

Use Aho-Corasick in a large-scale application:

20.3 Practice Implementing Aho-Corasick Under Time Constraints

To improve coding speed, try: