Search For Three Consecutive Words

The following awk program stores a count for how many times each set of three consecutive words occurs (after removing punctuation characters), and prints the counts and the set of words at the end if the count is larger than 1:

{
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++
}
END {
        for (key in w) {
                count = w[key]
                if (count > 1) {
                        gsub(SUBSEP," ",key)
                        print count, key
                }
        }
}

Given the text in your question, this produces

2 Search Inside Yourself
2 Cultivate The Three
2 The Three Essential
2 Joy on Demand
2 Recognize and Cultivate
2 Three Essential Virtues
2 and Cultivate The
2 The Ideal Team
3 Ideal Team Player

As you can see, this may not be so useful.

Instead, we can collect the same count information and then do a second pass over the file, printing each line that contains a word triplet with a count larger than one:

NR == FNR {
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++

        next
}

{
        orig = $0
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                if (w[$(i-2),$(i-1),$i] > 1) {
                        print orig
                        next
                }
}

Testing on your file:

$ cat file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
$ awk -f script.awk file file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues
Joy on Demand: The Art of Discovering the Happiness Within
Joy on Demand
Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself

Caveat: This awk program needs enough memory to store the text of your file about three times over, and may find duplicates in common phrases even when the entries are actually not truly duplicated (e.g. "how to cook" may be part of the titles of several books).


IMO, this task is better solved using the intersections of sets of words, rather than looking for 3 consecutive words.

Accordingly, the following perl script does not look for 3 consecutive words. Instead, it first reads in the entire input (from stdin and/or one or more files) and (using the Set::Tiny module) creates a set of words for each input line.

Then it processes the input a second time, and (for each line) it prints out any lines read in the first pass which have exact duplicates or where the intersection of sets has 3 or more elements.

It uses a hash array called %sets to store the word sets for each title, and another hash called %titles to count the number of times it has seen each title - this is used in the output phase to ensure it never prints any title more often than it was seen in the input.

In short, it prints duplicate lines and similar lines (i.e. those which have at least 3 of the same words in them) next to each other - the 3 words do not have to be consecutive.

The script ignores several very common small words when constructing the sets, but this can be disabled by commenting out or deleting the line with the OPTIONAL... comment. Or you can edit the common word list to suit your needs.

One thing worth mentioning is that the small words list in the script includes the word by. You can delete it from the list if you like, but the reason why it's there is to stop the script from matching on by plus any two other words - e.g. Aardvark Taxidermy for Personal Wealth by Peter Smith would match The Wealth of Nations by Adam Smith (matches on by, Wealth, and Smith). The first book is (I hope) entirely non-existent but if it did exist, it would not be at all related to an economics text.

Note: this script stores the entire input, and the associated word sets for each input line, in memory. This is unlikely to be a problem for modern systems with a few GiB of free RAM unless the input is extremely large.

Note2: Set::Tiny is packaged for Debian as libset-tiny-perl. It may be available pre-packaged for other distributions too. Otherwise, you can get it from the CPAN link above.

#!/usr/bin/perl -w

use strict;
use Set::Tiny;

# a partial list of common articles, prepositions and small words joined into
# a regex.
my $sw = join("|", qw(
  a about after against all among an and around as at be before between both
  but by can do down during first for from go have he her him how
  I if in into is it its last like me my new of off old
  on or out over she so such that the their there they this through to
  too under up we what when where with without you your)
);

my %sets=();    # word sets for each title.
my %titles=();  # count of how many times we see the same title.

while(<>) {
  chomp;
  # take a copy of the original input line, so we can use it as
  # a key for the hashes later.
  my $orig = $_;

  # "simplify" the input line
  s/[[:punct:]]//g;  #/ strip punctuation characters
  s/^\s*|\s*$//g;    #/ strip leading and trailing spaces
  $_=lc;             #/ lowercase everything, case is not important.
  s/\b($sw)\b//iog;  #/ optional. strip small words
  next if (/^$/);

  $sets{$orig} = Set::Tiny->new(split);
  $titles{$orig}++;
};

my @keys = (sort keys %sets);

foreach my $title (@keys) {
  next unless ($titles{$title} > 0);

  # if we have any exact dupes, print them. and make sure they won't
  # be printed again.
  if ($titles{$title} > 1) {
    print "$title\n" x $titles{$title};
    $titles{$title}  = 0;
  };

  foreach my $key (@keys) {
    next unless ($titles{$key} > 0);
    next if ($key eq $title);

    my $intersect = $sets{$key}->intersection($sets{$title});
    my $k=scalar keys %{ $intersect };

    #print STDERR "====>$k(" . join(",",sort keys %{ $intersect }) . "):$title:$key\n" if ($k > 1);

    if ($k >= 3) {
      print "$title\n" if ($titles{$title} > 0);
      print "$key\n" x $titles{$key};
      $titles{$key}   = 0;
      $titles{$title} = 0;
    };
  };
};

Save it as, e.g. blueray.pl, and make it executable with chmod +x.

Given the new sample input, it produces the following output:

$ ./blueray.pl TestData.txt 
7L: The Seven Levels of Communication
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
A History of Money and Banking in the United States: The Colonial Era to World War II
The History of Banking: The History of Banking and How the World of Finance Became What it is Today
America's Bank: The Epic Struggle to Create the Federal Reserve
America's Money Machine: The Story of the Federal Reserve
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
The Federal Reserve and its Founders: Money, Politics, and Power
The Power and Independence of the Federal Reserve
Venture Deals by Brad Feld
Venture Deals by Brad Feld & Jason Mendelson
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values

This is not exactly the same as your example output. Because it checks for the presence of common words in titles while ignoring their exact order, it is more likely to find false positives and less likely to miss matches that it shouldn't (false negatives).

If you want to experiment with this or just see what words it is matching (or almost matching) on, you can uncomment the #print STDERR line