Universal Levenshtein Automata

Part 1: Levenshtein Distance and Delete Edits

November 25, 2021,
keywords: automata, fuzzy string matching, levenshtein

My recent discovery of (Non-deterministic) Universal Levenshtein Automata (NULA) have been really engaging creatively. In this series of posts, I am going to collect together some definitions, theorems, and proofs about Levenshtein Distances and NULAs, and then provide a second set of definitions so that we can use bitwise operations for the implementation of NULAs. The following list of posts in this series will be updated as I add more posts.

Levenshtein Distance

Definition: The Levenshtein Distance between two strings is the minimal number of character edits required to change one string into the other. A character edit is inserting a character, deleting a character, or substituting a character for another.

Example: The Levenshtein distance between "abc" and "abd" is 1. The character edit substitutes 'c' with 'd'.

Example: The Levenshtein distance between "abc" and "abdc" is 1. The character edit inserts 'd' before 'c'.

Example: The Levenshtein distance between "abc" and "ac" is 1. The character edit delete the 'b'.

Example: The Levenshtein distance between "abc" and "acd" is 2. There are a few ways of doing two edits to "abc" to get "acd". One way is to substitute the 'b' for 'c' and the 'c' for 'd'. Another way is to delete the 'b' and insert the 'd'. This shows that the character edits involved need not be unique.

In some areas of applications, it is useful to consider more kinds of edits. For example, in spell checking inputs from a keyboard, you might consider swapping two adjacent characters (transpositions) as an edit, and in optical character recognition you might consider splitting one character into two and merging two characters into one as edits.

Some Terminology

Let $t$ be a string of length $n$. We will represents the $i$-th character of $t$ as $t_i$. So $t = t_1 t_2 \cdots t_n$.

We will represent character edits as follows.

1. Substituting the $i$-th character will be represented as $s_i$.
2. Deleting the $i$-th character will be represented as $\varepsilon_i$.
3. An insert edit will be represented as $i_k$, where $k$ is the $k$-th insert edit.

We will not use $s$ or $i$ for strings, and it will be clear from context whether we are using $i$ as an index or as an insert edit.

We will use $\$ as a sentinel character when needed, often to represent the end of strings. Replacting $\$ will not be allowed for substitution edits.

Example: Let $t$ be the string "abc". Then $t_1$ = 'a', $t_2$ = 'b', and $t_3$ = 'c'. One way of minimally editing $t$ to get "acd" can be represented as $t_1 s_2 s_3$, where $s_2$ = 'c' and $s_3$ = 'd'. Another way of minimally editing $t$ to get "acd" can be represented as $t_1 \varepsilon_2 t_3 i_1$, where $i_1$ = 'd'.

Lemma: Let $p$ and $t$ be two strings. Given a set of minimal character edits to transform $p$ to $t$, we can rearrange the edits so that every maximal sequence of delete edits are followed by a character match or are at end of the string.
Suppose we have the sequence $\varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j} i_n$ in $t$, i.e. we are deleting $j$ characters starting at character $p_k$ and then inserting a character. This can be rewritten as $i_n \varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j}$. This changes the edits so that we are inserting a character first, and then deleting $j$ characters starting at $p_k$.
Suppose we have the sequence $\varepsilon_k \varepsilon_{k+1} \cdots \varepsilon_{k + j} s_{k + j + 1}$ in $t$ with $s_{k + j + 1} \neq p_{k + j + 1}$. Here we are deleting $j$ characters starting at character $p_k$ and then substituting the $k+j+1$-th character. This can be rewritten as $s_{k} \varepsilon_{k+1} \varepsilon_{k+2} \cdots \varepsilon_{k + j + 1}$. This changes the edits so that we are substituting the $k$-th character, and then deleting $j$ characters starting at $p_{k+1}$. Note that $s_k$ cannot equal $p_k$ since that would mean we found a smaller set of edits than the minimal set of edits we started with.
The above two steps can be repeated for maximal sequences of delete edits in $t$ until all maximal sequences of delete edits are followed by a match $p_k$ or are at the end of the string. $\square$