Levenshtein Distance (Edit Distance) - CSU083 | Shoolini University

Levenshtein Distance (Edit Distance, DP-based)

1. Prerequisites

Before understanding Levenshtein Distance, you should be familiar with the following concepts:

2. What is Levenshtein Distance?

Levenshtein Distance, also known as Edit Distance, measures the minimum number of operations required to transform one string into another. The allowed operations are:

It is commonly implemented using Dynamic Programming by constructing a matrix where each cell represents the edit distance between substrings.

3. Why Does This Algorithm Exist?

Levenshtein Distance has many real-world applications, including:

4. When Should You Use It?

Levenshtein Distance is useful when:

However, for large datasets, it can be computationally expensive, so optimizations like trie-based approaches or approximate string matching may be needed.

5. How Does It Compare to Alternatives?

"
Algorithm Operations Allowed Time Complexity Best Use Case
Levenshtein Distance Insertion, Deletion, Substitution O(m×n) General-purpose string similarity
Damerau-Levenshtein Distance Insertion, Deletion, Substitution, Transposition O(m×n) Handling typos where letters are swapped
Hamming Distance Substitution only O(n) Equal-length strings (e.g., DNA comparison)
Jaro-Winkler Distance Transpositions & prefix weighting O(n²) Short strings, name matching

Strengths of Levenshtein Distance:

Weaknesses:

6. Basic Implementation

Below is the basic implementation of the Levenshtein Distance algorithm using Dynamic Programming in Python.

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Initialize the base cases
    for i in range(m + 1):
        dp[i][0] = i  # Cost of deleting characters
    for j in range(n + 1):
        dp[0][j] = j  # Cost of inserting characters

    # Fill the DP table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:  # Characters match
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(
                    dp[i - 1][j],    # Deletion
                    dp[i][j - 1],    # Insertion
                    dp[i - 1][j - 1] # Substitution
                )

    return dp[m][n]

# Example usage:
word1 = "kitten"
word2 = "sitting"
print(levenshtein_distance(word1, word2))  # Output: 3

7. Dry Run (Step-by-Step Execution)

We will dry-run the algorithm with the input strings s1 = "kitten" and s2 = "sitting".

7.1 Initial DP Table

We initialize a table dp[m+1][n+1] where each cell represents the edit distance between substrings.


     ""  s  i  t  t  i  n  g
  ---------------------------
"" | 0  1  2  3  4  5  6  7
k  | 1  
i  | 2  
t  | 3  
t  | 4  
e  | 5  
n  | 6  

7.2 Filling the Table Step by Step

7.3 Final DP Table After Computation


     ""  s  i  t  t  i  n  g
  ---------------------------
"" | 0  1  2  3  4  5  6  7
k  | 1  1  2  3  4  5  6  7
i  | 2  2  1  2  3  4  5  6
t  | 3  3  2  1  2  3  4  5
t  | 4  4  3  2  1  2  3  4
e  | 5  5  4  3  2  2  3  4
n  | 6  6  5  4  3  3  2  3

7.4 Extracting the Answer

The bottom-right cell dp[6][7] = 3 represents the minimum edit distance between "kitten" and "sitting".

7.5 Breakdown of Operations

To transform "kitten" to "sitting", we perform:

8. Time & Space Complexity Analysis

8.1 Worst-Case Time Complexity

In the worst case, every character in both strings is different, requiring us to compute the full DP table.

$$ O(m \times n) $$

8.2 Best-Case Time Complexity

If the two strings are identical, we only traverse the diagonal without performing insertions, deletions, or substitutions.

$$ O(\min(m, n)) $$

8.3 Average-Case Time Complexity

In real-world scenarios, most strings have minor differences, requiring computation for most dp table cells. This makes the expected complexity:

$$ O(m \times n) $$

9. Space Complexity Analysis

9.1 Naïve DP Approach (2D Table)

9.2 Optimized Space Approach (1D Table)

9.3 Recursive Approach (Inefficient)

10. Trade-Offs

10.1 Time vs. Space

10.2 Recursive vs. Iterative Approach

10.3 Alternatives with Lower Complexity

11. Optimizations & Variants

11.1 Common Optimizations

11.2 Variants of Levenshtein Distance

"
Variant Operations Allowed Use Case
Damerau-Levenshtein Distance Insertion, Deletion, Substitution, Transposition Handling common typos (e.g., swapping adjacent letters)
Restricted Edit Distance Insertion, Deletion, Transposition Applications where substitutions are not allowed
Hamming Distance Substitution only Binary strings or fixed-length sequences
Jaro-Winkler Distance Weighted similarity based on common prefixes Name matching, search queries

11.3 Optimized Implementation (Space-Efficient)

def levenshtein_optimized(s1, s2):
    m, n = len(s1), len(s2)
    if m < n:  # Ensure s1 is the longer string
        s1, s2 = s2, s1
        m, n = n, m
    
    prev = list(range(n + 1))
    curr = [0] * (n + 1)

    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(prev[j], curr[j - 1], prev[j - 1])
        prev, curr = curr, prev  # Swap rows to save space

    return prev[n]

print(levenshtein_optimized("kitten", "sitting"))  # Output: 3

12. Iterative vs. Recursive Implementations

12.1 Recursive Approach

Recursive implementation follows the direct mathematical definition of Levenshtein Distance:

def levenshtein_recursive(s1, s2, m, n):
    if m == 0:
        return n
    if n == 0:
        return m
    if s1[m - 1] == s2[n - 1]:
        return levenshtein_recursive(s1, s2, m - 1, n - 1)
    return 1 + min(
        levenshtein_recursive(s1, s2, m - 1, n),   # Deletion
        levenshtein_recursive(s1, s2, m, n - 1),   # Insertion
        levenshtein_recursive(s1, s2, m - 1, n - 1) # Substitution
    )

# Usage:
print(levenshtein_recursive("kitten", "sitting", 6, 7))  # Output: 3

Time Complexity: Exponential O(3^min(m, n)) (very inefficient without memoization).

Space Complexity: O(max(m, n)) due to recursive stack depth.

12.2 Iterative (DP) Approach

Iterative DP is significantly faster because it avoids redundant calculations.

12.3 Comparing Both Approaches

"
Implementation Time Complexity Space Complexity Pros Cons
Recursive O(3^min(m, n)) O(max(m, n)) Simple, intuitive Slow, redundant calculations, high memory use
Iterative DP O(m × n) O(m × n) or O(min(m, n)) Efficient, avoids redundant work More complex to implement

Conclusion: The recursive approach is only useful for understanding the problem, while the iterative DP method is optimal for practical use.

13. Edge Cases & Failure Handling

13.1 Common Pitfalls

13.2 Edge Cases

"
Input Strings Expected Output Reason
"", "" 0 Both strings are empty, no edits needed.
"abc", "" 3 All characters must be deleted.
"", "abc" 3 All characters must be inserted.
"kitten", "kitten" 0 Strings are identical, so no edits are required.
"kitten", "sitting" 3 Standard case: substitution + insertion + insertion.
"abc", "cba" 2 Two substitutions required.
"abcd", "abdc" 1 Single transposition needed, but Levenshtein does not account for transpositions.
"longstring", "short" 6 Must delete extra characters from "longstring."
"same", "same " 1 One insertion needed due to extra space.

14. Test Cases

14.1 Python Unit Tests

To verify correctness, we write test cases using Python's unittest framework.

import unittest

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])

    return dp[m][n]

class TestLevenshteinDistance(unittest.TestCase):
    def test_empty_strings(self):
        self.assertEqual(levenshtein_distance("", ""), 0)

    def test_one_empty_string(self):
        self.assertEqual(levenshtein_distance("abc", ""), 3)
        self.assertEqual(levenshtein_distance("", "abc"), 3)

    def test_identical_strings(self):
        self.assertEqual(levenshtein_distance("kitten", "kitten"), 0)

    def test_typical_cases(self):
        self.assertEqual(levenshtein_distance("kitten", "sitting"), 3)
        self.assertEqual(levenshtein_distance("abc", "cba"), 2)

    def test_substitutions(self):
        self.assertEqual(levenshtein_distance("abcd", "abdc"), 1)

    def test_large_strings(self):
        self.assertEqual(levenshtein_distance("longstring", "short"), 6)

if __name__ == "__main__":
    unittest.main()

14.2 Expected Output

Running the above test cases should output:


.
----------------------------------------------------------------------
Ran 6 tests in 0.001s

OK

15. Real-World Failure Scenarios

15.1 Large Input Size Bottleneck

15.2 Unicode and Special Characters

15.3 Case Sensitivity

15.4 Transpositions Not Handled

15.5 Real-Time Processing Issues

16. Real-World Applications & Industry Use Cases

16.1 Common Use Cases

16.2 Industry Use Cases

"
Industry Application Example
Search Engines Spelling correction, query expansion Google "Did you mean?" feature
Healthcare Genomic research, medical transcription DNA sequence alignment
E-commerce Product search, customer name matching Amazon's search recommendations
Cybersecurity Phishing detection, anomaly detection Identifying lookalike domain names
Finance Fraud detection, account deduplication Detecting similar customer records

17. Open-Source Implementations

17.1 Python Libraries

17.2 Example Usage of Open-Source Libraries

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Simple comparison
print(fuzz.ratio("kitten", "sitting"))  # Output: 72

# Find best match from a list
choices = ["sitting", "bitten", "mitten", "written"]
print(process.extractOne("kitten", choices))  # Output: ('mitten', 80)

17.3 Command-Line Tool (Linux)

The Levenshtein algorithm is implemented in Unix tools like diff for text comparison.

$ diff file1.txt file2.txt

17.4 GitHub Projects Using Levenshtein Distance

18. Practical Project: Fuzzy Name Matching

18.1 Project Overview

This script finds the closest match for a given name from a database using Levenshtein distance.

18.2 Python Implementation

from fuzzywuzzy import process

def find_closest_name(query_name, name_list):
    match, score = process.extractOne(query_name, name_list)
    return match, score

# Sample database of names
names = ["John Doe", "Jane Doe", "Johnny Depp", "Jonathan", "Joanna"]

# User input
input_name = "Jhon Deo"
best_match, confidence = find_closest_name(input_name, names)

print(f"Did you mean: {best_match}? (Confidence: {confidence}%)")

18.3 Expected Output


Did you mean: John Doe? (Confidence: 90%)

18.4 Potential Use Cases

18.5 Further Enhancements

19. Competitive Programming & System Design Integration

19.1 Competitive Programming Considerations

Levenshtein Distance is frequently tested in coding competitions under constraints like:

19.2 Example Competitive Programming Problem

Problem Statement: Given two words, find the minimum number of edits required to convert one into the other.

import sys

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    if m < n:
        s1, s2 = s2, s1
        m, n = n, m
    
    prev = list(range(n + 1))
    curr = [0] * (n + 1)

    for i in range(1, m + 1):
        curr[0] = i
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(prev[j], curr[j - 1], prev[j - 1])
        prev, curr = curr, prev  # Swap rows to save space

    return prev[n]

# Fast I/O for large test cases
input = sys.stdin.read
data = input().splitlines()
word1, word2 = data[0], data[1]
print(levenshtein_distance(word1, word2))

19.3 System Design Integration

Levenshtein Distance is useful in large-scale systems:

Example: Design a scalable typo-tolerant search system.

20. Assignments

20.1 Solve at least 10 problems using this algorithm

Practice Levenshtein Distance in various problem scenarios:

  1. Basic Edit Distance: Implement the DP solution.
  2. Space-Optimized Edit Distance: Use O(min(m, n)) space.
  3. Damerau-Levenshtein Distance: Handle transpositions.
  4. Batch Processing: Compute distances for multiple pairs.
  5. Fuzzy Search: Find closest words in a dictionary.
  6. DNA Sequence Matching: Compare genomic data.
  7. Speech Recognition: Use it for transcript correction.
  8. Plagiarism Detection: Compare document similarities.
  9. Competitive Coding: Solve a Levenshtein problem under time constraints.
  10. Real-time Spell Checking: Implement a fast spell corrector.

20.2 Use Levenshtein Distance in a System Design Problem

Design a large-scale system using this algorithm. Choose one:

20.3 Practice Implementing Under Time Constraints

Time yourself while implementing:

Track progress and aim for efficiency improvements.