How do I use grep to find lines, in which any word occurs 3 times?

Using the standard word definition,

  • GNU Grep, 3 or more occurrences of any word.

    grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file

  • GNU Grep, only 3 occurrences of any word.

    grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file | grep -Ev '(\W|^)(\w+)\W(.*\<\2\>){3}'

  • POSIX Awk, only 3 occurences of any word.

    awk -F '[^_[:alnum:]]+' '{           # Field separator is non-word sequences
        split("", cnt)                   # Delete array cnt
        for (i=1; i<=NF; i++) cnt[$i]++  # Count number of occurrences of each word
        for (i in cnt) {
            if (cnt[i]==3) {             # If a word appears exactly 3 times
                print                    # Print the line
    }' file

    For 3 or more occurences, simply change == to >=.

    Equivalent golfed one-liner:

    awk -F '[^_[:alnum:]]+' '{split("",c);for(i=1;i<=NF;i++)c[$i]++;for(i in c)if(c[i]==3){print;next;}}' file

  • GNU Awk, only 3 occurrences of the word ab.

    gawk 'gsub(/\<ab\>/,"&")==3' file

    For 3 or more occurences, simply change == to >=.

Reading material

  • \2 is a back-reference.
  • \w \W \< \> special expressions in GNU Grep.
  • The [:alnum:] POSIX character class.

Like this?

egrep '(\<.+\>).+\<\1\>.+\<\1\>'
  • egrep (or grep -E) enables extended regexes, which are required for backreferences
  • \<.+\> will match any word of at least 1 character
    • \< resp \> match word boundaries (in your attempt you didn't take word boundaries into account at all)
    • .+ matches a sequence of one or more characters (in your attempt you used .* which matches a sequence of zero or more characters!)
  • use back-references, to check whether the matched sequence occurs a 2nd time (\1) and a 3rd time (\1 again).
    • we allow any sequence of one or more characters (.+) between the matches, so "foo bar foo dorbs foo godly" will match (there's 3 occurences of the word "foo").
    • if you only want to match adjacent words (e.g. "foo foo foo"), use something like [[:space:]]+ instead.

I assume that your question means if any of the words in the line exists at least 3 times, then print the line, else discard it. I would use awk, for a more readable and customizable solution:

awk -F '\\W+' '{
    delete c; for (i=1;i<=NF;i++) if (length($i) && ++c[$i]==3) {print; next}
}' file

It is a loop for all fields, counting their occurences per line. If any word reaches 3 times, it will print the line, delete the array and go to next line. Also a test for the length of the field exists to avoid printing on any empty fields counted.

Here we can easily customize the meaning of "word" by adding different and/or many field separators, using -F (the standard BREs and EREs are supported). In the above, word separators are all characters except _ and [:alnum:]: awk -F '\\W+' or awk -F '[^_[:alnum:]]+', similar to matching word bountaries with grep.

For a human language, we may need different word bountaries, like everything except the letters, like: awk -F '[^[:alpha:]]+' or except letters and digits: awk -F '[^[:alnum:]]+' or to include not only the underscore, but the dash also into words: awk -F '[^-_[:alnum:]]+'.

Without setting -F, only the whitespace characters are used.