Command line method to find repeat-word typos, with line numbers

Edited: added install and demo

You need to take care of at least some edge cases, like

  • repeated words at the end (and beginning) of the line.
  • search should be case insensitive, because of frequent errors like The the apple.
  • probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
  • only full words should match to eliminate the thesis
  • When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

  • has has
  • and
  • and
  • be be

You should take a peek at the venerable diction(1) and style(1) commands. They catch a variety of boo-boos. There are newish versions (GPLv3 here on Fedora 23).

Install

For example on Debian-based distributions, install the package diction, which includes style:

$ sudo apt-get install diction

At least in Fedora it is:

$ dnf install diction

Red Hat Enterprise (and clones) probably need:

$ yum install diction

In any case, this comes from an upstream GNU package called diction, so it should be called the same almost everywhere.

Example

$ diction jefferson_typo.txt
jefferson_typo.txt:1: He has [has] refused his Assent to Laws, the [most] wholesome and necessary for the public good.

jefferson_typo.txt:3: He has forbidden his Governors to pass Laws of immediate and [and] pressing importance, unless suspended in their operation till his Assent should be [be] obtained; and when [so] suspended, he has utterly neglected to attend to them.

2 phrases in 2 sentences found.

Pros

  • catches the repeated words, amongst other things

Cons

  • introduces [] markings for items not related to repeated words. For example [so], is probably marked because it can be considered extraneous per The Elements of Style by Strunk. See man diction
  • the number shown is not always the original input's line number, but is instead the line number that the sentence starts from. So for example [be] is original input's line number 5, but here it shows 3 only because [be] is a part of the sentence beginning on line 3. So this is slightly different than what you wanted

This will print lines (with filename and line number) with repeated words:

for f in *.txt; do
    perl -ne 'print "$ARGV: $.: $_" if /\b(\w+)\W+\1/' "$f"
done

For multi-line matching there's this, but you lose the line numbers because it's slurping in the file by paragraphs (that's the effect of the -00 option). The \W+ between the two words means any "non-word" characters, including newlines.

perl -00 -nE '
    @matches = /\b((\w+)\W+\2)/g; 
    while (@matches) {
        ($match,$word) = splice @matches, 0, 2;
        say "dup: $match";
    }
' jefferson_typo.txt 
dup: has has
dup: and
and
dup: be be