Suffix Array & Suffix Tree - CSU083 | Shoolini University

Suffix Array & Suffix Tree

1. Prerequisites

To understand Suffix Arrays and Suffix Trees, you must be familiar with the following concepts:

2. Core Concepts

2.1 Suffix Array

A Suffix Array is an array of integers that represents the starting indices of sorted suffixes of a given string.

Example: For the string "banana", the suffixes are:


banana
anana
nana
ana
na
a

The sorted order of these suffixes and their starting indices gives the Suffix Array:


Suffix Array: [5, 3, 1, 0, 4, 2]

2.2 Suffix Tree

A Suffix Tree is a compressed trie containing all suffixes of a given string.

It provides linear-time operations for:

It is a tree where each edge represents a substring, and each path from the root to a leaf represents a suffix.

3. Why Does This Algorithm Exist?

Suffix Arrays and Suffix Trees are designed for efficient string processing. Their key applications include:

4. When Should You Use It?

Suffix Arrays and Suffix Trees should be used when:

5. How Does It Compare to Alternatives?

5.1 Suffix Array vs. Suffix Tree

Feature Suffix Array Suffix Tree
Memory Usage Lower Higher
Construction Time O(n log n) (Radix Sort O(n)) O(n)
Pattern Search O(m log n) O(m)
Space Efficiency Better Worse

5.2 Alternative Approaches

6. Basic Implementation

6.1 Suffix Array Implementation in Python

The Suffix Array is built by sorting all suffixes of a string and storing their starting indices.


def build_suffix_array(s):
    suffixes = [(s[i:], i) for i in range(len(s))]
    suffixes.sort()  # Sort lexicographically
    suffix_array = [suffix[1] for suffix in suffixes]
    return suffix_array

# Example usage:
text = "banana"
suffix_array = build_suffix_array(text)
print("Suffix Array:", suffix_array)

6.2 Suffix Tree Implementation in Python

Suffix Trees use a tree-like structure for efficient string operations.


class SuffixTreeNode:
    def __init__(self):
        self.children = {}
        self.index = []

class SuffixTree:
    def __init__(self, text):
        self.root = SuffixTreeNode()
        self.text = text
        for i in range(len(text)):
            self._insert_suffix(text[i:], i)

    def _insert_suffix(self, suffix, index):
        node = self.root
        for char in suffix:
            if char not in node.children:
                node.children[char] = SuffixTreeNode()
            node = node.children[char]
            node.index.append(index)

    def search(self, pattern):
        node = self.root
        for char in pattern:
            if char not in node.children:
                return []
            node = node.children[char]
        return node.index

# Example usage:
text = "banana"
tree = SuffixTree(text)
print("Pattern 'ana' found at indices:", tree.search("ana"))

7. Dry Run: Step-by-Step Execution

7.1 Dry Run of Suffix Array Construction

Given text = "banana", we generate suffixes:

Suffix Index
"banana" 0
"anana" 1
"nana" 2
"ana" 3
"na" 4
"a" 5

Sorting the suffixes lexicographically:

Sorted Suffix Original Index
"a" 5
"ana" 3
"anana" 1
"banana" 0
"na" 4
"nana" 2

Final Suffix Array: [5, 3, 1, 0, 4, 2]

7.2 Dry Run of Suffix Tree for Search Query "ana"

For text = "banana", the suffix tree is constructed by inserting:

Searching for "ana":

Pattern "ana" found at positions: [1, 3]

8. Time & Space Complexity Analysis

8.1 Suffix Array Complexity

8.2 Suffix Tree Complexity

9. Space Consumption Analysis

9.1 Suffix Array Space Growth

9.2 Suffix Tree Space Growth

10. Trade-offs Between Suffix Array & Suffix Tree

Factor Suffix Array Suffix Tree
Construction Time O(n log n) O(n)
Search Time O(m log n) O(m)
Space Complexity O(n) O(n^2) (worst-case)
Memory Efficiency Better Higher memory usage
Best Use Case Space-efficient search & compression (e.g., BWT) Fast substring operations (e.g., bioinformatics, search engines)

Key Takeaways:

11. Optimizations & Variants

11.1 Optimizations for Suffix Array

11.2 Optimizations for Suffix Tree

11.3 Variants of Suffix Array

11.4 Variants of Suffix Tree

12. Iterative vs. Recursive Implementations

12.1 Iterative Suffix Array Construction

Iterative algorithms are space-efficient as they avoid function call overhead. Radix sort-based suffix array construction is often implemented iteratively.


def build_suffix_array_iterative(s):
    n = len(s)
    suffixes = sorted([(s[i:], i) for i in range(n)])
    return [suffix[1] for suffix in suffixes]

text = "banana"
print("Iterative Suffix Array:", build_suffix_array_iterative(text))

12.2 Recursive Suffix Tree Construction

Recursive approaches provide a clearer logical structure but can lead to stack overflow for large inputs.


class SuffixTreeNode:
    def __init__(self):
        self.children = {}

class SuffixTree:
    def __init__(self, text):
        self.root = SuffixTreeNode()
        self.text = text
        self.build_recursive(0, self.root)

    def build_recursive(self, index, node):
        if index >= len(self.text):
            return
        char = self.text[index]
        if char not in node.children:
            node.children[char] = SuffixTreeNode()
        self.build_recursive(index + 1, node.children[char])

text = "banana"
tree = SuffixTree(text)

12.3 Comparison of Iterative vs. Recursive Approaches

Factor Iterative Recursive
Memory Usage Lower (no stack overhead) Higher (recursion stack used)
Performance Faster in practice Slower due to function calls
Code Simplicity More complex but efficient Readable but memory-heavy
Best Use Case Large-scale text processing Conceptual understanding, debugging

Key Takeaways:

13. Edge Cases & Failure Handling

13.1 Common Edge Cases

13.2 Failure Handling Strategies

14. Test Cases to Verify Correctness

14.1 Test Cases for Suffix Array


def test_suffix_array():
    assert build_suffix_array("") == []
    assert build_suffix_array("a") == [0]
    assert build_suffix_array("banana") == [5, 3, 1, 0, 4, 2]
    assert build_suffix_array("aaaa") == [3, 2, 1, 0]
    assert build_suffix_array("abc") == [2, 1, 0]
    assert build_suffix_array("12345") == [4, 3, 2, 1, 0]
    print("All test cases passed for Suffix Array.")

test_suffix_array()

14.2 Test Cases for Suffix Tree


def test_suffix_tree():
    tree = SuffixTree("banana")
    assert tree.search("ana") == [1, 3]
    assert tree.search("nana") == [2]
    assert tree.search("x") == []
    assert tree.search("") == []
    assert tree.search("banana") == [0]
    print("All test cases passed for Suffix Tree.")

test_suffix_tree()

15. Real-World Failure Scenarios

15.1 Bioinformatics (Genome Sequences)

15.2 Search Engines (Fast Query Matching)

15.3 Plagiarism Detection

15.4 Data Compression (BWT in bzip2)

15.5 Cybersecurity (Detecting Substring Patterns in Malware Analysis)

16. Real-World Applications & Industry Use Cases

16.1 Search Engines

16.2 Bioinformatics

16.3 Data Compression (bzip2, Burrows-Wheeler Transform)

16.4 Plagiarism Detection

16.5 Cybersecurity

16.6 Natural Language Processing (NLP)

17. Open-Source Implementations

17.1 Suffix Array Implementations

17.2 Suffix Tree Implementations

18. Practical Project: Keyword Search in Large Texts

18.1 Problem Statement

Build a simple tool that finds occurrences of a keyword in a large text file using a suffix array.

18.2 Implementation in Python


import bisect

def build_suffix_array(text):
    suffixes = sorted([(text[i:], i) for i in range(len(text))])
    return [suffix[1] for suffix in suffixes]

def binary_search(text, suffix_array, pattern):
    left, right = 0, len(suffix_array)
    while left < right:
        mid = (left + right) // 2
        suffix = text[suffix_array[mid]:]
        if suffix.startswith(pattern):
            return suffix_array[mid]  # Found
        elif suffix < pattern:
            left = mid + 1
        else:
            right = mid
    return -1  # Not found

# Example usage
text = "this is a sample text where we search for sample"
suffix_array = build_suffix_array(text)

pattern = "sample"
index = binary_search(text, suffix_array, pattern)

if index != -1:
    print(f"Pattern '{pattern}' found at index: {index}")
else:
    print("Pattern not found")

18.3 Expected Output


Pattern 'sample' found at index: 10

18.4 Extensions & Future Work

19. Competitive Programming & System Design Integration

19.1 Competitive Programming

Suffix Arrays and Suffix Trees are widely used in coding competitions for solving complex string problems efficiently.

Why are they useful?
Common Problem Categories

19.2 System Design Use Cases

Suffix Arrays and Suffix Trees are integrated into scalable systems where fast substring search is required.

Examples
Challenges in System Design

20. Assignments

20.1 Solve at Least 10 Problems Using This Algorithm

Practice these problems on coding platforms like Codeforces, Leetcode, or AtCoder.

  1. Find all occurrences of a pattern in a text using a suffix array.
  2. Find the longest repeated substring in a string.
  3. Find the longest common substring between two given strings.
  4. Lexicographically sort all suffixes of a string.
  5. Find the shortest unique substring in a given text.
  6. Count distinct substrings of a given string.
  7. Implement the Burrows-Wheeler Transform using a suffix array.
  8. Detect plagiarism by comparing multiple documents using suffix trees.
  9. Use a suffix array to search for multiple patterns in a text efficiently.
  10. Optimize text autocomplete using suffix trees.

20.2 Use It in a System Design Problem

Design a scalable text search system for a large dataset (e.g., a search engine for academic papers).

Requirements

20.3 Practice Implementing It Under Time Constraints

Set a strict 1-hour timer and try implementing:

Track your time and improve efficiency with repeated practice.