TDSM 10.5

From The Data Science Design Manual Wikia
Jump to: navigation, search

Definition

The definition of edit distance: Given sequences x, y, and functions insert, delete, and substitute. Associate cost with each insert, delete and substitute operations. Edit distance is the minimum cost of any edit sequences that transforms x into y. 

Note that the cost is usually unit one. If two strings are similar, then their edit distance is small. The edit distance problem can be solved by dynamic programming in O( |x|•|y| ).

Examples:

edit(“kit”, “kite") = 1. “kit” —> insert one char ‘e’ —> “kite”

edit(“kite”, “kit”) = 1. “kite” —> delete one char ‘e’ —> “kit”

edit(“kite”, “cite”) = 1. “kite” —> substitute ‘k’ with ‘c’ —>”cite”

edit(“bob”, “bob”) = 0. “bob” —> no change needed —> “bob”

edit(“caitao”, “zhan”) = 5. “caitao” —> “aitao” —> “itao” —> “ztao” —> “zhao” —> “zhan”


Proof

To prove edit distance being a metric, it has to meet the four following requirements:

1. Positivity: d(x, y) ≥ 0 for all x and y.


2. Identity: d(x, y) = 0 if and only if x = y.


3. Symmetry: d(x, y) = d(y, x) for all x and y.


4. Triangle inequality: d(x, y) ≤ d(x, z) + d(z, y) for all x, y, and z.


From the definition of edit distance, 1 and 2 are obviously true. Now consider 3 and 4.


Prove 3:

Assume we transform x to y using an edit sequences of insert_xy, delete_xy, and substitute_xy. Then we can transform y to x by reversing insert_xy into delete_yx, delete_xy into insert_yx, substitute_xy into substitute_yx. The number of operations is the same. If transforming x to y is the minimum cost, then transforming y to x will also be the minimum cost. The edit distance is the same.

For example,

edit(“caitao”, “zhan”) = 5. “caitao” —> “aitao” —> “itao” —> “ztao” —> “zhao” —> “zhan”

edit(“zhan”, “caitao”) = 5. “zhan” —> “zhao” —> “ztao” —> “itao” —> “aitao” —> “caitao”


Prove 4:

According to the definition of edit distance, the meaning of d(x, y) is the minimum cost of transforming x to y. Note that it starts from x and ends at y.

According to the definition of edit distance, the meaning of d(x, z) + d(z, y) is the minimum cost of transforming x to z, then transform z to y. Note that it also starts from x and ends at y. We can compose the sequences of operations from x to z (oper_1 … oper_m) and z to y (oper_1 … oper_n ) together into one sequence (oper_1 … oper_m … oper_1 … oper_n ). This composition of x to z and z to y is a valid edit from x to y. The number of this composed sequence’s operation must ≥ edit(x, y) according to the definition of edit distance.

Therefore, Triangle inequality: d(x, y) ≤ d(x, z) + d(z, y) is proved.