find duplicate addresses in database, stop users entering them early?

You could use the Google GeoCode API

Wich in fact gives results for both of your examples, just tried it. That way you get structured results that you can save in your database. If the lookup fails, ask the user to write the address in another way.


The earlier you can stop people, the easier it'll be in the long run!

Not being too familiar with your db schema or data entry form, I'd suggest a route something like the following:

  • have distinct fields in your db for each address "part", e.g. street, city, postal code, Länder, etc.

  • have your data entry form broken down similarly, e.g. street, city, etc

The reasoning behind the above is that each part will likely have it's own particular "rules" for checking slightly-changed addressed, ("Quellenstrasse"->"Quellenstr.", "66/11"->"66a-11" above) so your validation code can check if the values as presented for each field exist in their respective db field. If not, you can have a class that applies the transformation rules for each given field (e.g. "strasse" stemmed to "str") and checks again for duplicates.

Obviously the above method has it's drawbacks:

  • it can be slow, depending on your data set, leaving the user waiting

  • users may try to get around it by putting address "Parts" in the wrong fields (appending post code to city, etc). but from experience we've found that introducing even simple checking like the above will prevent a large percentage of users from entering pre-existing addresses.

Once you've the basic checking in place, you can look at optimising the db accesses required, refining the rules, etc to meet your particular schema. You might also take a look at MySQL's match() function for working out similar text.