Context sensitive diff implementation

It is clear that reporting a difference in terms of the smallest change to the structures you have isn't what you want; you want to report some context.

To do that, you have to identify what context there is to report, so you can decide what part of that is interesting. You sketched an idea where you fused certain elements of your structure together (e.g., 'SOP' '1' '2' into 'SOP 1 2'), but that seems to me to be going the wrong way. What it is doing is changing the size of the smallest structure elements, not reporting better context.

While I'm not sure this is the right approach, I'd try to characterize your structures using a grammar, e.g., a BNF. For instance, some grammar rules you might have would be:

 action = 'SOP' sop_item_list ;
 sop_item_list = NATURAL ;
 sop_item_list = sop_item_list ',' NATURAL ;

Now an actual SOP item can be characterized an an abstract syntax tree (show nested children, indexable by constants to get to subtrees):

 t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))

You still want to compute a difference using something like the dynamic programming algorithm you have suggested, but now you want a minimal tree delta. Done right (my company makes tools that do this for grammars for conventional computer languages, and you can find publicly available tree diff algorithms), you can get a delta like:

  replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]

which in essence is what you got by gluing (SOP,1,2) into a single item, but without the external adhocery of you personally deciding to do that.

The real value in this, I think, it that you can use the tree t1 to report the context. In particular, you start at the root of the tree, and print summaries of the subtrees (clearly you don't want to print full subtrees, as that would just give you back the full original text). By printing subtrees down to a depth of 1 or two levels and eliding anything deep (e.g, representing a list as "..." and single subtree as "_"), you could print something like:

replace 'SOP' ... with 'SOP',...,3

which I think is what you are looking for in your specific example.

No, this isn't an algorithm; it is a sketch of an idea. The fact we have tree-delta algorithms that compute useful deltas, and the summarizing idea (taken from LISP debuggers, frankly) suggests this will probably generalize to something useful, or a least take you in a new direction.

Having your answer in terms of ASTs, should also make it relatively easy to produce HTML as you desire. (People work with XML all the time, and XML is basically a tree representation).


In case you would decide to proceed with what you already achieved (IMO you got pretty far already), you might consider doing two steps of diff.

Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare.

That's a good start; you already managed to make the context clear to the user.

This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3'". This get's cluttered when there is a large list and you are attempting to determine what actually changed.

How about doing another diff pass across the tokens found (i.e. compare 'SOP 1, 2' with 'SOP 1, 2, 3'), this time without the word grouping, to generate additional info? That would make the full output something like this:

Change 'SOP 1, 2' to 'SOP 1, 2, 3'

Change details: Add '3'

The text is a bit cryptic, so you may want to do some refining there. I would also suggest to truncate lengthy tokens in the first line ('SOP 1, 2, 3, ...'), since the second line should already provide sufficient detail.

I am not sure about the performance impact of this second pass; in a big text with many changes, you may experience many roundtrips to the diff functionality. You might optimize by accumulating changes from the first pass into one 'change document', run the second pass across it, then stitch the results together.

HTH.


Try thinking in terms of Prolog-style rewrite rules that transform your instructions into a canonical form that will cause the diff algorithm to produce what you need. The problem you specified would be solved by this rule:

SOP i1, i2, ... iN  ->  SOP j1, SOP j2, ... SOP jN  where j = sorted(i)

In other words, "distribute" SOP over a sorted list of the following integers. This will trick the diff algorithm into giving a fully qualified change report "Add SOP 3."

Rules are applied by searching the input for matches of the left hand side and replacing with the corresponding right.

You are probably already doing this, but you will get more commonsense analysis if the input is tokenized: "SOP" should considered a single "character" for the diff algorithm. Whitespace may be reduced to tokens for space and line break if they are significant or else ignored.

You can do another kind of diff at a the character level to test "fuzzy" equality of tokens to account for typographical errors when matching left-hand-sides. "SIP" and "SOP" would be counted a "match" because their edit distance is only 1 (and I and O are only one key apart on a QUERTY keyboard!).

If you consider all the quirks in the output you're getting now and can rectify each one as a rewrite rule that takes the input to a form where the diff algorithm produces what you need, then what's left is to implement the rewrite system. Doing this in a general way that is efficient so that changing and adding rules does not involve a great deal of ad hoc coding is a difficult problem, but one that has been studied. It's interesting that @Ira Baxter mentioned lisp, as it excels as a tool for this sort of thing.

Ira's suggestion of syntax trees falls naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:

SOP 1 section 3, paras 2,1,3

is a hierarchy that should be rewritten as

SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3

The rewrite rules

paras i1, i2, ... iN  ->  para j1, para j2, ... para jN  where j = sorted(i)
section K para i1, ... para iN  ->s section K para j1, ... section K para j1
SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1 

when applied in three passes will produce a result like "SOP 1 section 3, para 4 was added."

While there are many strategies to implementing the rules and rewrites, including coding each one as a procedure in VB (argh...), there are other ways. Prolog is a grand attempt to do this as generally as possible. There is a free implementation. There are others. I have used TXL to get useful rewriting done. The only problem with TXL is that it assumes you have a rather strict grammar for inputs, which it doesn't sound like the case in your problem.

If you post more examples of the quirks in your current outputs, I can follow this up with more detail on rewrite rules.