Edit Distance Algorithm Is Optimal

Written by Mike James

Wednesday, 17 June 2015

The edit distance is a measure of how close two strings are and it is used in a lot of important applications including spell checkers and genome analysis. Currently the best known algorithm takes O(n^2) operations and this is often too slow. Now we have a proof that you can't do any better.

theory

The edit distance, also called the Levenshtein distance, between two strings is easy to define. It is simply the smallest number of deletes, inserts or substitutes that you need to turn one string into another.

Another way to say it is that it is the smallest number of point mutations needed to convert one string to another. This terminology comes from genetics where the edit distance is used to measure how far one DNA strand is from another.

You can also guess that it is useful in spelling correctors measuring how close a misspelled word is to possible correct candidates. In the spelling application the edit distance is often modified to allow transposition between two adjacent characters because this is a common error.

The edit distance is easy to define but difficult to compute. For example what is the edit distance between abc and bca? You might well think that its three - substitute a for b, b for c and c for a but it isn't. In fact you can change the second string into the first by deleting the a and inserting an a at the start - giving an edit distance of 2.

It almost seems that you have to have some intelligence to work out the edit distance because of the need to find the smallest number of point edits. In fact there is a recursive definition that leads to a recursive algorithm.

Clearly if one of the two strings S1, S2 is null the distance is just the length of the non-null string or zero if both are null.

If both are non-null and we know the edit distance of the S1-1 and S2-1 i.e. both strings with the final character removed then the edit distance is the minimum of three cases:

Edit(S1-1,S2-1)+1 if the final characters are the different and Edit(S1-1,S2-1) if they are the same.

Edit(S1-1,S2)+1
and
Edit(S1,S2-1)+1

Let's try this on abc and bca. Both strings are non-null so we need to work out:

Edit(ab,bc)=2 delete c insert a

Edit(ab,bca)=3 delete c delete b insert b

Edit(abc,bc)=1 insert a

Using these values we get

Edit(abc,bca) = min (2+1,3,1+1)=2

Notice that in working out the values of Edit(ab,bc) and the rest we would have used the recursion until we reached Edit(a,null), Edit(null,b) and Edit(null,null) and then worked forward.

This recursive definition is exact but slow. The usual way of working things out it to give up on the recursion and simply work forward from Edit(a,null) Edit(null,b) and Edit(null,null). This can be organized in a table that can be filled in a row at a time. All of the entries with null are trivial and are just the length of the string:

null b bc bca

null 0 1 2 3

a 1

ab 2

abc 3

The rest of the next row can be computed using only the values we already have.

As a!=b

Edit(a,b)= min(Edit(null,b),Edit(a,null),Edit(null,null))+1=1

As a!=c

Edit(a,bc)=min(Edit(null,bc),Edit(a,b),Edit(null,b))+1=2

As a==a

Edit(a,bca)=Edit(null,bc)=2

Now we have the next row of the table:

null b bc bca

null 0 1 2 3

a 1 1 2 2

ab 2

abc 3

To get the third row we use the same method:

As b=b

Edit(ab,b)= Edit(a,null)=1

As b!=c

Edit(ab,bc)=min(Edit(a,bc),Edit(ab,b),Edit(a,b))+1=2

As b!=a

Edit(ab,bca)=min(Edit(a,bca),Edit(ab,bc),Edit(a,bc))+1=3

This makes the table:

null b bc bca

null 0 1 2 3

a 1 1 2 2

ab 2 1 2 3

abc 3

The final row is computed in the same way:

As c!=b

Edit(abc,b)= min(Edit(ab,b),Edit(abc,null),Edit(ab,null))+1=2

As c==c

Edit(abc,bc)=Edit(ab,b))=1

As c!=a

Edit(abc,bca)=min(Edit(ab,bca),Edit(abc,bc),Edit(ab,bc))+1=2

The complete table is:

null b bc bca

null 0 1 2 3

a 1 1 2 2

ab 2 1 2 3

abc 3 2 1 2

and we can read off the edit distance between abc and bca as 2.

You can see that this algorithm, a slight modification of the Wagner-Fischer algorithm works in roughly n-squared operations for strings of length n. For finding the edit distance of DNA sequences with millions of base pairs this is too slow. even though it isn't exponential. Because of its practical importance a lot of time has been devoted to searching for a better algorithm that would be sub-quadratic.

Now we know that all that work has been wasted because two MIT researchers have proved that quadratic is almost certainly optimal.

The proof is interesting in itself and has resulted in the technique being applied to other outstanding problems. The idea is that it is demonstrated that the existence of a sub-quadratic algorithm for the edit distance would imply the existence of a sub-exponential algorithm for the 3-SAT problem. The 3-SAT problem is simply finding a set of values for three variables that makes a logical expression true - and in general this is NP complete. The strong exponential time hypothesis states that this is impossible and all NP complete problems need exponential time to solve.

So as long as the strong exponential time hypothesis holds, there is no magic sub-quadratic algorithm for the edit distance. The majority of computer scientists, but not all of them, seem to believe that the strong exponential time hypothesis does hold. It is true that it if did hold then P!=NP, but if it is not true it has no implication for the P!=NP question, i.e. it can be false and P!=NP. So it is not "protected" by the likelihood that P does indeed not equal NP.

What this means is that some might still try to find a faster edit distance algorithm. if only to refute the strong exponential time hypothesis.

theory

More Information

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) Arturs Backurs, Piotr Indyk

Longstanding problem put to rest

Coding Contest Outperforms Megablast

A New DNA Sequence Search - Compressive Genomics

Candy Crush Is Harder Than It Sounds - NP Hard

The Machine In The Ghost

Complexity Theorist Gets Abel Prize

Tensor Operations Are NP Hard

Vertex Coverings And The Cool Kids Problem

Finding Solutions To Diophantine Equations By Smell

Travelling Salesman - A Movie About P=NP

Physics Is NP Hard

Computational Complexity

Computability

Pancake flipping is hard - NP hard

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin, or sign up for our weekly newsletter.

Google Introduces Gemini Canvas
31/03/2025

Google has added a new tool to Gemini. Canvas is a new interactive space within Gemini that can be used to create and refine documents or code.

+ Full Story

Kafka 4 Adds Queue Semantics Support
17/04/2025

Kafka 4.0 has been released, with major changes. This is the first version to operate entirely in KRaft mode by default. It also adds a new consumer group protocol designed to dramatically improve reb [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 17 June 2015 )

More Information

Related Articles

Comments