Disambiguate messy place names in python (preferably on local machine)

You could try the Python library geodict. This has datasets you can download and import to a database - you can check the lists to see if they'd work well or not with your data. It works in two steps:

  1. Extracting names
  2. Matching names to a location in the lists

More details (and another online option in the comments) here.


I assume your best guess is to use a fuzzy algorithm.

Take your local dictionary of place names and administrative units and compare each word and each comma-separated block of text against this dictionary. Assign a score to each match. You might want to use a normalized search to account for spelling mistakes and have an "ignore list" for words like "live" and "work" and "in". Add the score for administrative units to the score of any smaller unit or place name in your matches that lie within this administrative unit.

Tune the scoring function with your results until you are happy. Take the best scoring match.

e.g.: Roma, Italy 
Roma matches 8 places (score according to size)
Roma matches 23 more places with normalization (lower score according to size)
Italy matches 4 places + 2 administrative units (COUNTRY, DISTRICT) (score acconding to size)
Italy matches 14 more places and units with normalization (lower score according to size)
One of the Romas lies in one of your units. -> combine scores

If you tuning is good, you will have given most points to the capital of Italy.