// Copyright 2022 The Go Authors. All rights reserved. | |

// Use of this source code is governed by a BSD-style | |

// license that can be found in the LICENSE file. | |

// package lcs contains code to find longest-common-subsequences | |

// (and diffs) | |

package lcs | |

/* | |

Compute longest-common-subsequences of two slices A, B using | |

algorithms from Myers' paper. A longest-common-subsequence | |

(LCS from now on) of A and B is a maximal set of lexically increasing | |

pairs of subscripts (x,y) with A[x]==B[y]. There may be many LCS, but | |

they all have the same length. An LCS determines a sequence of edits | |

that changes A into B. | |

The key concept is the edit graph of A and B. | |

If A has length N and B has length M, then the edit graph has | |

vertices v[i][j] for 0 <= i <= N, 0 <= j <= M. There is a | |

horizontal edge from v[i][j] to v[i+1][j] whenever both are in | |

the graph, and a vertical edge from v[i][j] to f[i][j+1] similarly. | |

When A[i] == B[j] there is a diagonal edge from v[i][j] to v[i+1][j+1]. | |

A path between in the graph between (0,0) and (N,M) determines a sequence | |

of edits converting A into B: each horizontal edge corresponds to removing | |

an element of A, and each vertical edge corresponds to inserting an | |

element of B. | |

A vertex (x,y) is on (forward) diagonal k if x-y=k. A path in the graph | |

is of length D if it has D non-diagonal edges. The algorithms generate | |

forward paths (in which at least one of x,y increases at each edge), | |

or backward paths (in which at least one of x,y decreases at each edge), | |

or a combination. (Note that the orientation is the traditional mathematical one, | |

with the origin in the lower-left corner.) | |

Here is the edit graph for A:"aabbaa", B:"aacaba". (I know the diagonals look weird.) | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

b | | | ___/‾‾‾ | ___/‾‾‾ | | | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

c | | | | | | | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a a b b a a | |

The algorithm labels a vertex (x,y) with D,k if it is on diagonal k and at | |

the end of a maximal path of length D. (Because x-y=k it suffices to remember | |

only the x coordinate of the vertex.) | |

The forward algorithm: Find the longest diagonal starting at (0,0) and | |

label its end with D=0,k=0. From that vertex take a vertical step and | |

then follow the longest diagonal (up and to the right), and label that vertex | |

with D=1,k=-1. From the D=0,k=0 point take a horizontal step and the follow | |

the longest diagonal (up and to the right) and label that vertex | |

D=1,k=1. In the same way, having labelled all the D vertices, | |

from a vertex labelled D,k find two vertices | |

tentatively labelled D+1,k-1 and D+1,k+1. There may be two on the same | |

diagonal, in which case take the one with the larger x. | |

Eventually the path gets to (N,M), and the diagonals on it are the LCS. | |

Here is the edit graph with the ends of D-paths labelled. (So, for instance, | |

0/2,2 indicates that x=2,y=2 is labelled with 0, as it should be, since the first | |

step is to go up the longest diagonal from (0,0).) | |

A:"aabbaa", B:"aacaba" | |

⊙ ------- ⊙ ------- ⊙ -------(3/3,6)------- ⊙ -------(3/5,6)-------(4/6,6) | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ -------(2/3,5)------- ⊙ ------- ⊙ ------- ⊙ | |

b | | | ___/‾‾‾ | ___/‾‾‾ | | | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ -------(3/5,4)------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ -------(1/2,3)-------(2/3,3)------- ⊙ ------- ⊙ ------- ⊙ | |

c | | | | | | | | |

⊙ ------- ⊙ -------(0/2,2)-------(1/3,2)-------(2/4,2)-------(3/5,2)-------(4/6,2) | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | | |

⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ | |

a a b b a a | |

The 4-path is reconstructed starting at (4/6,6), horizontal to (3/5,6), diagonal to (3,4), vertical | |

to (2/3,3), horizontal to (1/2,3), vertical to (0/2,2), and diagonal to (0,0). As expected, | |

there are 4 non-diagonal steps, and the diagonals form an LCS. | |

There is a symmetric backward algorithm, which gives (backwards labels are prefixed with a colon): | |

A:"aabbaa", B:"aacaba" | |

⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ | |

a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | | |

⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ --------(:0/5,5)-------- ⊙ | |

b | | | ____/‾‾‾ | ____/‾‾‾ | | | | |

⊙ -------- ⊙ -------- ⊙ --------(:1/3,4)-------- ⊙ -------- ⊙ -------- ⊙ | |

a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | | |

(:3/0,3)--------(:2/1,3)-------- ⊙ --------(:2/3,3)--------(:1/4,3)-------- ⊙ -------- ⊙ | |

c | | | | | | | | |

⊙ -------- ⊙ -------- ⊙ --------(:3/3,2)--------(:2/4,2)-------- ⊙ -------- ⊙ | |

a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | | |

(:3/0,1)-------- ⊙ -------- ⊙ -------- ⊙ --------(:3/4,1)-------- ⊙ -------- ⊙ | |

a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | | |

(:4/0,0)-------- ⊙ -------- ⊙ -------- ⊙ --------(:4/4,0)-------- ⊙ -------- ⊙ | |

a a b b a a | |

Neither of these is ideal for use in an editor, where it is undesirable to send very long diffs to the | |

front end. It's tricky to decide exactly what 'very long diffs' means, as "replace A by B" is very short. | |

We want to control how big D can be, by stopping when it gets too large. The forward algorithm then | |

privileges common prefixes, and the backward algorithm privileges common suffixes. Either is an undesirable | |

asymmetry. | |

Fortunately there is a two-sided algorithm, implied by results in Myers' paper. Here's what the labels in | |

the edit graph look like. | |

A:"aabbaa", B:"aacaba" | |

⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ | |

a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | |

⊙ --------- ⊙ --------- ⊙ --------- (2/3,5) --------- ⊙ --------- (:0/5,5)--------- ⊙ | |

b | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | | | |

⊙ --------- ⊙ --------- ⊙ --------- (:1/3,4)--------- ⊙ --------- ⊙ --------- ⊙ | |

a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | |

⊙ --------- (:2/1,3)--------- (1/2,3) ---------(2:2/3,3)--------- (:1/4,3)--------- ⊙ --------- ⊙ | |

c | | | | | | | | |

⊙ --------- ⊙ --------- (0/2,2) --------- (1/3,2) ---------(2:2/4,2)--------- ⊙ --------- ⊙ | |

a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | |

⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ | |

a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | |

⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ | |

a a b b a a | |

The algorithm stopped when it saw the backwards 2-path ending at (1,3) and the forwards 2-path ending at (3,5). The criterion | |

is a backwards path ending at (u,v) and a forward path ending at (x,y), where u <= x and the two points are on the same | |

diagonal. (Here the edgegraph has a diagonal, but the criterion is x-y=u-v.) Myers proves there is a forward | |

2-path from (0,0) to (1,3), and that together with the backwards 2-path ending at (1,3) gives the expected 4-path. | |

Unfortunately the forward path has to be constructed by another run of the forward algorithm; it can't be found from the | |

computed labels. That is the worst case. Had the code noticed (x,y)=(u,v)=(3,3) the whole path could be reconstructed | |

from the edgegraph. The implementation looks for a number of special cases to try to avoid computing an extra forward path. | |

If the two-sided algorithm has stop early (because D has become too large) it will have found a forward LCS and a | |

backwards LCS. Ideally these go with disjoint prefixes and suffixes of A and B, but disjointness may fail and the two | |

computed LCS may conflict. (An easy example is where A is a suffix of B, and shares a short prefix. The backwards LCS | |

is all of A, and the forward LCS is a prefix of A.) The algorithm combines the two | |

to form a best-effort LCS. In the worst case the forward partial LCS may have to | |

be recomputed. | |

*/ | |

/* Eugene Myers paper is titled | |

"An O(ND) Difference Algorithm and Its Variations" | |

and can be found at | |

http://www.xmailserver.org/diff2.pdf | |

(There is a generic implementation of the algorithm the the repository with git hash | |

b9ad7e4ade3a686d608e44475390ad428e60e7fc) | |

*/ |