Best way to detect similar email addresses?

If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.

Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.


Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.

1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2

2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.

OR

Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.


You could start by applying some prioritization to which emails to compare to one another.

A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.

For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.

You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.

Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.


You could add a few optimizations:

1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.

2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.

3) Consider stripping the domain in general. [email protected] and [email protected] will never trigger your flag.