Count lines wider than 80 columns, taking tabs correctly into account

Preprocess the files by piping them through expand. The expand utility will expand tabs appropriately (using the standard tab stops at every 8th character).

find . -type f \( -name '*.[ch]' -o -name '*.p[ly]' \) -exec expand {} + |
awk 'length > 80 { n++ } END { print n }'

GNU wc -L doesn't treat TABs as 8 characters, it treats TABs as they would be displayed in a terminal with TAB stops every 8 columns so would have a "width" ranging from 1 to 8 characters depending on where they're found on the line. wc -L also considers the display width of other characters (whether they're 0, 1 or 2 columns wide) and also processes \f and \r "correctly".

$ printf 'abcde\t\n' | wc -L
8

Here, you could use expand (which by default also assumes tab stops every 8 columns though you can change it with options) to expand those TABs to spaces:

git grep -h '' ./**/*.{c,h,p{l,y}} | expand | tr '\f\r' '\n\n' | grep -cE '.{81}'

(converting the CRs (which when sent to a terminal move the cursor back to the beginning of the line) and FFs (which some display devices understand as a page-break) to LF to get the same behaviour as wc -L, but ignoring the other ones which anyway we can't tell what influence they will have on the display width).

That covers TABs but not single-width or double-width characters. Note that the GNU implementation of expand currently doesn't expand TABs properly if there are multi-byte characters (let alone zero-width or double-width ones).

$ printf 'ééééé\t\n' | wc -L
8
$ printf 'ééééé\t\n' | expand | wc -L
11

Also note that ./**/*.{c,h,p{l,y}} would by default skip hidden files or files in hidden directories. As the brace expansion expands to several globs, you would also get errors (fatal with zsh or bash -O failglob) if either of those globs don't match.

With zsh, you'd use ./**/*.(c|h|p[ly])(D.) which is one glob, and where D includes hidden files and . restricts to regular files.

For a solution that takes into account the actual width of characters (assuming all the text files are encoded in the locale's character encoding) you could use:

git grep -h '' ./**/*.(c|h|p[ly])(.) | tr '\r\f' '\n\n' |
  perl -Mopen=locale -MText::Tabs -MText::CharWidth=mbswidth -lne '
    $n++ if mbswidth(expand($_)) > 80;
    END{print 0+$n}'

Note that at least on GNU systems, mbswidth() considers control characters as having a width of -1 and 1 for expand(). We assume no control character other than CR, NL, TAB, FF are found in the files.


If we can assume per your comment that tab characters will appear only at the beginning of lines, then we can count alternatives to a minimum of 80 characters.

  • No tabs, at least 81 characters
  • One tab, at least 73 characters
  • Two tabs, at least 65 characeters
  • Etc.

The resulting mess is as follows, with your awk statement summing the individual line counts to provide a grand total

git grep -hcP '^(.{81,}|\t.{73,}|\t{2}.{65,}|\t{3}.{57,}|\t{4}.{49,}|\t{5}.{41,}|\t{6}.{33,}|\t{7}.{25,}|\t{8}.{17,}|\t{9}.{9,}|\t{10}.)' **/*.{c,h,p{l,y}} |
    awk '{ i+=$1 } END { printf ("%d\n", i) }'

Tags:

Grep