Keep only the lines containing exact number of delimiters

Another POSIX one:

awk -F , 'NF == 11' <file

If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk use , as the field delimiter. If the number of fields is 11, the condition NF == 11 is true, awk then performs the default action print $0.


Using egrep (or grep -E in POSIX):

egrep "^([^,]*,){10}[^,]*$" file.csv

This filters out anything not containing 10 commas: it matches full lines (^ at the start and $ at the end), containing exactly ten repetitions ({10}) of the sequence "any number of characters except ',', followed by a single ','" (([^,]*,)), followed again by any number of characters except ',' ([^,]*).

You can also use the -x parameter to drop the anchors:

grep -xE "([^,]*,){10}[^,]*" file.csv

This is less efficient than cuonglm's awk solution though; the latter is typically six times faster on my system for lines with around 10 commas. Longer lines will cause huge slowdowns.


The simplest grep code that will work:

grep -xE '([^,]*,){10}[^,]*'

Explanation:

-x ensures that the pattern must match the entire line, rather than just part of it. This is important so you don't match lines with more than 10 commas.

-E means "extended regex", which makes for less backslash-escaping in your regex.

Parentheses are used for grouping, and the {10} afterwards means there must be exactly ten matches in a row of the pattern within the parantheses.

[^,] is a character class—for instance, [c-f] would match any single character that is a c, a d, an e or an f, and [^A-Z] would match any single character that is NOT an uppercase letter. So [^,] matches any single character except a comma.

The * after the character class means "zero or more of these."

So the regex part ([^,]*,) means "Any character except a comma any number of times (including zero times), followed by a comma" and the {10} specifies 10 of these. Then [^,]* to match the rest of the non-comma characters to the end of the line.

Tags:

Csv

Filter