compare text and get differences

Google has something similar and it is available in C#, but have not looked at it any deeper. The demo looks pretty cool though.

http://code.google.com/p/google-diff-match-patch/


I have a class library that does this, I'll post a link below, but I'll also post how it does its job so that you can evaluate whether it will be fitting for your content.

Note that for everything I say below, if you think of each character as an element of a collection, you can implement the algorithm described below for any type of content. Be it characters of a string, lines of text, collections of ORM-objects.

The whole algorithm revolves around longest-common-substring (LCS), and is a recursive approach.

First the algorithm tries to find the LCS between the two. This will be the longest section that is unchanged/identical between the two versions. The algorithm then considers these two parts to be "aligned".

For instance, here's how two example strings would be aligned:

      This long text has some text in the middle that will be found by LCS
This extra long text has some text in the middle that should be found by LCS
          ^-------- longest common substring --------^

Then it recursively applies itself to the portions before the aligned section, and the portion afterwards.

The final "result" could look like this (I'm using the underscore to indicate portions "not there" in one of the strings):

This ______long text has some text in the middle that ______will be found by LCS
This extra long text has some text in the middle that should____ be found by LCS

Then, as part of the recursive approach, each level of recursive call will return a collection of "operations", which based on whether there's a LCS, or missing portions in either part, will spit out as follows:

  • If LCS, then it is a "copy" operation
  • If missing from first, then it is a "insert" operation
  • If missing from second, then it is a "delete" operation

So the above text would be:

  1. Copy 5 characters (This)
  2. Insert extra_ (apparently code-blocks here remove space, the underscore is a space)
  3. Copy 43 characters (long text has some text in the middle that_)
  4. Insert should
  5. Delete 4 characters (will)
  6. Copy 16 characters (_be found by LCS)

The core of the algorithm is quite simple, and with the above text, you should be able to implement it yourself, if you want to.

There are some extra features in my class library, in particular to handle such things as content that is similar to the changed text, so that you don't just get delete or insert operations, but also modify operations, this will mostly be important if you're comparing a list of something, like lines from text files.

The class library can be found here: DiffLib on GitHub, and you will also find it on Nuget for easy installation in Visual Studio 2010. It is written in C# for .NET 3.5 and up, so it will work for .NET 3.5 and 4.0, and since it is a binary release (all source code is on GitHub though), you can use it from VB.NET as well.