How do I diff utf-16 files with GNU diff?

vimdiff works quite nicely for this purpose.

I found it while reading this StackOverflow answer.


Install ripgrep utility which supports UTF-16, then run:

diff <(rg -N . file1.txt) <(rg -N . file2.txt)

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)


From the GNU diff documentation:

Handling Multibyte and Varying-Width Characters

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

I never realized that myself.

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

http://www.guiffy.com/Diff-Tool.html


Malforms patches when accent marks or special characters are used:

 diff --version
 diff (GNU diffutils) 3.6
 diff -Naur old_foo new_foo > foo.patch

Correctly handles accent marks or special characters regardless of whether compared files/dirs are in a git folder.

 git --version
 git version 2.17.1
 git diff --no-index old_foo new_foo > foo.patch

Tags:

Unicode

Diff

Gnu