Twitter Trending Topics: Combine different spellings

What you basically want is to find the similarity between two strings.

I think the Soundex algorithm is what you're looking for. It can be used for comparing strings based on how they sound. Or as wiki describes:

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

And:

Using this algorithm [EDIT: that is, "rating" words by a letter and three digits], both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" yields "A261".

There's also the Levenshtein distance.

Good luck.


I'll try to answer my own question based on Broken Link's comment (thank you for this):


You've extracted phrases consisting of 1 to 3 words from your database of documents. Among these extraced phrases there are the following phrases:

  • Half Blood Prince
  • Half-Blood Prince
  • Halfblood Prince

For each phrase, you strip all special characters and blank spaces and make the string lowercase:

$phrase = 'Half Blood Prince'; $phrase = preg_replace('/[^a-z]/i', '', $phrase); $phrase = strtolower($phrase); // result is "halfbloodprince"

When you've done this, all 3 phrases (see above) have one spelling in common:

  • Half Blood Prince => halfbloodprince
  • Half-Blood Prince => halfbloodprince
  • Halfblood Prince => halfbloodprince

So "halfbloodprince" is the parent phrase. You insert both into your database, the normal phrase and the parent phrase.

To show a "Trending Topics Admin" like Twitter's you do the following:

// first select the top 10 parent phrases
$sql1 = "SELECT parentPhrase, COUNT(*) as cnt FROM phrases GROUP BY parentPhrase ORDER BY cnt DESC LIMIT 0, 10";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
    $parentPhrase = $sql3['parentPhrase'];
    $childPhrases = array(); // set up an array for the child phrases
    $fifthPart = round($sql3['cnt']*0.2);
    // now select all child phrases which make 20% of the parent phrase or more
    $sql4 = "SELECT phrase FROM phrases WHERE parentPhrase = '".$sql3['parentPhrase']."' GROUP BY phrase HAVING COUNT(*) >= ".$fifthPart;
    $sql5 = mysql_query($sql4);
    while ($sql6 = mysql_fetch_assoc($sql5)) {
        $childPhrases[] = $sql3['phrase'];
    }
    // now you have the parent phrase which is on the left side of the arrow in $parentPhrase
    // and all child phrases which are on the right side of the arrow in $childPhrases
}

Is this what you thought of, Broken Link? Would this work?


There are many ways to do this. One straight-forward article about google style "did you mean" checking is a good read for ideas on how to achieve this. written by peter norvig, director of research at google.

http://norvig.com/spell-correct.html