Find any lines exceeding a certain length

In order of decreasing speed (on a GNU system in a UTF-8 locale and on ASCII input) according to my tests:

grep '.\{80\}' file

perl -nle 'print if length$_>79' file

awk 'length>79' file

sed -n '/.\{80\}/p' file

Except for the perl¹ one (or for awk/grep/sed implementations (like mawk or busybox) that don't support multi-byte characters), that counts the length in terms of number of characters (according to the LC_CTYPE setting of the locale) instead of bytes.

If there are bytes in the input that don't form part of valid characters (which happens sometimes when the locale's character set is UTF-8 and the input is in a different encoding), then depending on the solution and tool implementation, those bytes will either count as 1 character, or 0 or not match ..

For instance, a line that consists of 30 as a 0x80 byte, 30 bs, a 0x81 byte and 30 UTF-8 és (encoded as 0xc3 0xa9), in a UTF-8 locale would not match .\{80\} with GNU grep/sed (as that standalone 0x80 byte doesn't match .), would have a length of 30+1+30+1+2*30=122 with perl or mawk, 3*30=90 with gawk.

If you want to count in terms of bytes, fix the locale to C with LC_ALL=C grep/awk/sed....

That would have all 4 solutions consider that line above contains 122 characters. Except in perl and GNU tools, you'd still have potential issues for lines that contain NUL characters (0x0 byte).

^{¹ the perl behaviour can be affected by the PERL_UNICODE environment variable though}

Find any lines exceeding a certain length

Tags:

Text Processing

Related

Recent Posts