Edit Distance Algorithm Is Optimal
Written by Mike James   
Wednesday, 17 June 2015

The edit distance is a measure of how close two strings are and it is used in a lot of important applications including spell checkers and genome analysis. Currently the best known algorithm takes O(n^2) operations and this is often too slow. Now we have a proof that you can't do any better. 

theory

The edit distance, also called the Levenshtein distance, between two strings is easy to define. It is simply the smallest number of deletes, inserts or substitutes that you need to turn one string into another.

Another way to say it is that it is the smallest number of point mutations needed to convert one string to another. This terminology comes from genetics where the edit distance is used to measure how far one DNA strand is from another.

You can also guess that it is useful in spelling correctors measuring how close a misspelled word is to possible correct candidates. In the spelling application the edit distance is often modified to allow transposition between two adjacent characters because this is a common error. 

The edit distance is easy to define but difficult to compute. For example what is the edit distance between abc and bca? You might well think that its three - substitute a for b, b for c and c for a but it isn't. In fact you can change the second string into the first by deleting the a and inserting an a at the start - giving an edit distance of 2. 

It almost seems that you have to have some intelligence to work out the edit distance because of the need to find the smallest number of point edits. In fact there is a recursive definition that leads to a recursive algorithm.

Clearly if one of the two strings S1, S2 is null the distance is just the length of the non-null string or zero if both are null. 

If both are non-null and we know the edit distance of the S1-1 and S2-1 i.e. both strings with the final character removed then the edit distance is the minimum of three cases:

Edit(S1-1,S2-1)+1 if the final characters are the different and Edit(S1-1,S2-1) if they are the same.

Edit(S1-1,S2)+1
and
Edit(S1,S2-1)+1

Let's try this on abc and bca. Both strings are non-null so we need to work out:

Edit(ab,bc)=2   delete c insert a

Edit(ab,bca)=3  delete c delete b insert b

Edit(abc,bc)=1 insert a

Using these values we get 

Edit(abc,bca) = min (2+1,3,1+1)=2

Notice that in working out the values of Edit(ab,bc) and the rest we would have used the recursion until we reached Edit(a,null), Edit(null,b) and Edit(null,null) and then worked forward. 

This recursive definition is exact but slow. The usual way of working things out it to give up on the recursion and simply work forward from Edit(a,null) Edit(null,b) and Edit(null,null). This can be organized in a table that can be filled in a row at a time. All of the entries with null are trivial and are just the length of the string:

         null   b      bc     bca

null    0      1      2       3

a       1    

ab     2

abc   3

The rest of the next row can be computed using only the values we already have.

As a!=b

Edit(a,b)= min(Edit(null,b),Edit(a,null),Edit(null,null))+1=1

As a!=c

Edit(a,bc)=min(Edit(null,bc),Edit(a,b),Edit(null,b))+1=2

As a==a

Edit(a,bca)=Edit(null,bc)=2

Now we have the next row of the table:

         null    b     bc     bca

null    0       1     2       3

a       1        1    2        2     

ab     2

abc   3

To get the third row we use the same method:

As b=b

Edit(ab,b)= Edit(a,null)=1

As b!=c

Edit(ab,bc)=min(Edit(a,bc),Edit(ab,b),Edit(a,b))+1=2

As b!=a

Edit(ab,bca)=min(Edit(a,bca),Edit(ab,bc),Edit(a,bc))+1=3

This makes the table:

         null  b    bc     bca

null    0     1    2     3

a       1      1    2     2     

ab     2      1    2     3

abc   3

The final row is computed in the same way:

As c!=b

Edit(abc,b)= min(Edit(ab,b),Edit(abc,null),Edit(ab,null))+1=2

As c==c

Edit(abc,bc)=Edit(ab,b))=1

As c!=a

Edit(abc,bca)=min(Edit(ab,bca),Edit(abc,bc),Edit(ab,bc))+1=2

The complete table is:

         null     b     bc    bca

null    0       1      2      3

a       1       1      2       2     

ab     2       1      2       3

abc   3       2      1       2

and we can read off the edit distance between abc and bca as 2.

You can see that this algorithm, a slight modification of the Wagner-Fischer algorithm works in roughly n-squared operations for strings of length n. For finding the edit distance of DNA sequences with millions of base pairs this is too slow. even though it isn't exponential. Because of its practical importance a lot of time has been devoted to searching for a better algorithm that would be sub-quadratic. 

Now we know that all that work has been wasted because two MIT researchers have proved that quadratic is almost certainly optimal. 

The proof is interesting in itself and has resulted in the technique being applied to other outstanding problems. The idea is that it is demonstrated that the existence of a sub-quadratic algorithm for the edit distance would imply the existence of a sub-exponential algorithm for the 3-SAT problem. The 3-SAT problem is simply finding a set of values for three variables that makes a logical expression true - and in general this is NP complete. The strong exponential time hypothesis states that this is impossible and all NP complete problems need exponential time to solve.   

So as long as the strong exponential time hypothesis holds, there is no magic sub-quadratic algorithm for the edit distance. The majority of computer scientists, but not all of them, seem to believe that the strong exponential time hypothesis does hold. It is true that it if did hold then P!=NP, but if it is not true it has no implication for the P!=NP question, i.e. it can be false and P!=NP. So it is not "protected" by the likelihood that P does indeed not equal NP. 

What this means is that some might still try to find a faster edit distance algorithm. if only to refute the strong exponential time hypothesis.

theory

Banner


Getting Going With RAG
20/01/2025

IBM has produced a cookbook of tips and methodologies on how to use RAG to power up any kind of business applications. Microsoft and Docling both provide tools for data ingestion from a range of  [ ... ]



Demystifying GPU Terminology
17/01/2025

The developers at Modal have created the GPU Glossary to help themselves and others get to grips with termionology related to NVIDIA GPU hardware and software. They have managed to collect,  [ ... ]


More News

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 17 June 2015 )