How to count the lines containing one of two words but not both

perl -nE 'END {say $c+0} ++$c if /\bthe\b/i xor /\ban\b/i' file
gawk 'END {print c+0} /\<the\>/ != /\<an\>/ {++c}' IGNORECASE=1 file

Comparing the results from matching each expression can give the outcome you want.

For example, the result of matching \<the\> may be either 0 or 1. If the result of the other match is the same, then both regexps were either found or not found, and the line should not be counted. If they differ it means that one match was found and the other was not, so the counter is incremented.

gawk has a built-in xor() function:

gawk 'END {print c+0} xor(/\<the\>/,/\<an\>/) {++c}' IGNORECASE=1 file

With grep:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.

Breakdown:

The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.

If you remove the c from the second grep's -Eci, you will see all matches highlighted.

Details:

  • The -E option enables extended expression syntax (ERE) for grep.

  • The -i option tells grep to match case-insensitive

  • The -v option tells grep to invert the result (i.e. match lines not containing the pattern)

  • The -c option tells grep to output the number of matched lines instead of the lines themselves

  • The patterns:

    1. \< matches the beginning of a word (thanks @glenn-jackman)
    2. \> matches the end of a word (thanks @glenn-jackman)

    --> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')

    1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the'

    2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'

    3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.

    4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

EDIT 1: Use \< and \> instead of ( |^) and ( |$), as suggested by @glenn-jackman

EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

This uses the -o option of grep, which prints every match in a separate line (and nothing else) and then wc -l to count the lines.


The following GNU awk program should do the trick:

awk '(/(^|\W)[Tt]he(\W|$)/ && !/(^|\W)[Aa]n(\W|$)/) || (/(^|\W)[Aa]n(\W|$)/ && !/(^|\W)[Tt]he(\W|$)/) {c++} END{print c}' poem.txt

This will increase the counter c, if either

  • the line matches (^|\W)[Tt]he(\W|$) (first-letter-case-insensitive the, preceded by non-word constituent (\W) or begin of line (^), and followed by non-word constituent (\W) or end-of line ($)) but not (^|\W)[Aa]n(\W|$) (the isolated first-letter-case-insensitive an) - OR -
  • the line matches (^|\W)[Aa]n(\W|$) but not (^|\W)[Tt]he(\W|$)

In the end, print the value of c.

It can be formulated slightly shorter using \< and \> for "beginning-of-word" and "end-of-word":

awk '(/\<[Tt]he\>/ && !/\<[Aa]n\>/) || (/\<[Aa]n\>/ && !/\<[Tt]he\>/) {c++} END{print c}' poem.txt

Even shorter would be:

awk '/\<[Tt]he\>/ != /\<[Aa]n\>/ {c++} END{print c}' poem.txt

as the inequality is only ever true if either, but not both (nor none) of an and the are present on a line.

This approach requires GNU awk because the \W and \< / \> constructs are GNU extensions to the extended regular expression syntax (but \< / \> are also understood by BSD regexes).

Notice that the pipeline construct you showed in your own attempted solution won't work, as calling grep with a file as input parameter supersedes reading from stdin, so the first part of the pipeline would simply vanish unnoticed, with the output being entirely due to the last part (which looks for occurences of an, even those embedded in other words).