Bulk geocoding 20 million US addresses

For that many records, don't even consider a web service. They will throttle or cut you off before you can finish your task.

So then your option becomes to run it locally, and for that you have several commercial or free options.

The free options will use the census TIGER dataset which you will need to load into a spatial database. You can find libraries that geocode against TIGER for PostGIS or even sqlite. Heck you can even use ArcGIS to geocode against TIGER. Of course, ArcGIS is not free, which brings me to the next commercial options. If you do have an ArcGIS license chances are you have StreetMap DVD with a TeleAtlas (I mean Tom Tom) or Navteq dataset. That depends if you got StreetMap Premium bundled. Any of those two datasets will probably give you more consistent results than TIGER.

Do yourself a favor and make several copies of the street database once your data is loaded and run the geocoding process on several machines with a subset of the input data. Dont try to run it on just one machine or you will be waiting for days for it to finish not to mention that most likely whatever process you run will probably leak memory and crash several times before it finishes. This means that you want to have different checkpoints for your process.


I work at SmartyStreets (an address verification company). Our service is free for everyone (up to the basic level). Startups can also request to use our service completely free for the first year. So if you fit that classification, there's no charge for our unlimited service for a year.

Ragi recommends against a web-service, however, our API can easily clean, standardize and geocode 20 million addresses for you in about 5 hours (approximately 1000 per second). Some of that time will depend on the speed of your machine (how many cores you have) and your network connection (don't try it over 3G, but a standard broadband connection will do just fine).

Just wanted to point out that it is certainly possible with a web service.


As of 1 Aug 2017, I have remotely tested our web service and gotten a sustained 70,000 lookups per second using only a single 2015 MacBook Pro on a wireless network. Yeah, it's pretty fast. That means a small list like 20 million addresses would only take about 5 minutes.


I used this walkthrough describing how to build a postgis geocoder using 2010 TigerLine data. I'm running it right now - it's not fast, as it's going to take 3 weeks to geocode 2 million addresses.

However, it's free, unthrottled, and took someone with minimal coding and postgres skills less than 2 days to set up and load with one (large) state's data to begin geocoding. I've also done absolutely no postgres tuning for the system and it's running over NFS mounts, so I suspect there's one or two orders of magnitude worth of performance gains I could get out of it if I needed to.

Rather than using web services, I loaded all my addresses into the postgres database, and then I'm running a quick and dirty perl script to geocode them all one at a time:

perl -e for ($i=1; $i<[max_key_value]; $i+=1) 
   {printf "UPDATE source_addresses
               SET (rating, new_address, lon, lat) 
                     = (g.rating, pprint_addy(g.addy), 
                       ST_X(g.geomout), ST_Y(g.geomout) ) 
              FROM (SELECT DISTINCT ON (address_id) address_id, (g1.geo).* 
                      FROM (SELECT address_id, (geocode(address)) As geo 
                              FROM source_addresses As ag 
                             WHERE ag.rating IS NULL and address_id = $i 
                           ) As g1 
                     ORDER BY address_id, rating LIMIT 1
                   ) As g WHERE g.address_id = source_addresses.address_id;\n"
  } | psql -d geocoder 

(line breaks solely for readability)

So that generates a "geocode the address with this ID value and use the best match" update statement, and pipes it to psql to do it. It only attempts to geocode address with no rating - i.e. ones it's not already geocoded. So it's restartable, and each one is done independently.