How to diff files ignoring comments (lines starting with #)?

According to Gilles, the -I option only ignores a line if nothing else inside that set matches except for the match of -I. I didn't fully get it until I tested it.

The Test

Three files are involved in my test:
File test1:

    text

File test2:

    text
    #comment

File test3:

    changed text
    #comment

The commands:

$ # comparing files with comment-only changes
$ diff -u -I '#.*' test{1,2}
$ # comparing files with both comment and regular changes
$ diff -u -I '#.*' test{2,3}
--- test2       2011-07-20 16:38:59.717701430 +0200
+++ test3       2011-07-20 16:39:10.187701435 +0200
@@ -1,2 +1,2 @@
-text
+changed text
 #comment

The alternative way

Since there is no answer so far explaining how to use the -I option correctly, I'll provide an alternative which works in bash shells:

diff -u -B <(grep -vE '^\s*(#|$)' test1)  <(grep -vE '^\s*(#|$)' test2)

diff -u - unified diff
- -B - ignore blank lines
<(command) - a bash feature called process substitution which opens a file descriptor for the command, this removes the need for a temporary file
grep - command for printing lines (not) matching a pattern
- -v - show non-matching lines
- E - use extended regular expressions
- '^\s*(#|$)' - a regular expression matching comments and empty lines
  - ^ - match the beginning of a line
  - \s* - match whitespace (tabs and spaces) if any
  - (#|$) match a hash mark, or alternatively, the end of a line

Try:

diff -b -I '^#' -I '^ #' file1 file2

Please note that the regex has to match the corresponding line in both files and it matches every changed line in the hunk in order to work, otherwise it'll still show the difference.

Use single quotes to protect pattern from shell expanding and to escape the regex-reserved characters (e.g. brackets).

We can read in diffutils manual:

However, -I only ignores the insertion or deletion of lines that contain the regular expression if every changed line in the hunk (every insertion and every deletion) matches the regular expression.

In other words, for each non-ignorable change, diff prints the complete set of changes in its vicinity, including the ignorable ones. You can specify more than one regular expression for lines to ignore by using more than one -I option. diff tries to match each line against each regular expression, starting with the last one given.

This behavior is also well explained by armel here.

Related: How can I perform a diff that ignores all comments?

After searching around the web, I found a method similar to Lekensteyn's.

But I want use the diff output as input to patch, and grep -v changes the formatting, so I can't.

Here's an improvement, maybe :

diff -u -B <(sed 's/^[[:blank:]]*#.*$/ /' file1)  <(sed 's/^[[:blank:]]*#.*$/ /' file2)

It's not perfect, but line numbers are kept in the patch file.

However, if a new line is added instead of comment line, then the comment will cause the hunk to fail while patching:

File test1:
  text
  #comment
  other text
File test2:
  text
  new line here
  #comment changed
  other text changed

Testing that data with our command:

$ echo -e "#!/usr/bin/sed -f\ns/^[[:blank:]]*#.*$/ /" > outcom.sed
$ echo "diff -u -B <(./outcom.sed \$1)  <(./outcom.sed \$2)" > mydiff.sh
$ chmod +x mydiff.sh outcom.sed
$ ./mydiff.sh file1 file2 > file.dif
$ cat file.dif
--- /dev/fd/63  2014-08-23 10:05:08.000000000 +0200
+++ /dev/fd/62  2014-08-23 10:05:08.000000000 +0200
@@ -1,2 +1,3 @@
 text
+new line
  
-other text
+other text changed

/dev/fd/62 & /dev/fd/63 are file produced by process substitution. The line between "+new line" and "-other text" is the default space character, which we defined in our sed expression to replace comments.

Applying that patch gives us the error :

$ patch -p0 file1 < file.dif 
patching file file1
Hunk #1 FAILED at 1.
1 out of 1 hunk FAILED -- saving rejects to file file1.rej

A solution is to not use the unified diff format; so, without -u:

$ echo "diff -B <(./outcom.sed \$1)  <(./outcom.sed \$2)" > mydiff.sh
$ ./mydiff.sh file1 file2 > file.dif
$ cat file.dif
1a2
> new line
3c4
< other text
---
> other text changed
$ patch -p0 file1 < file.dif 
patching file file1
$ cat file1
text
new line
#comment
other text changed

Now the patch file works (no guarantees for anything more complex, though).

How to diff files ignoring comments (lines starting with #)?

The Test

The alternative way

Tags:

Diff

Regular Expression

Related

Recent Posts