How to Geocode 300,000 addresses on the fly?

Mehul, I used to work in the address verification industry with a company called SmartyStreets. There are lots of geocoding services out there, but only few will support batch processing with the volume you require. (Google and others don't permit bulk use of their API or storing/caching results.)

If you go to your MySQL database and perform an export of your table which contains the addresses, save it as a CSV file for example. You can then process it using the Bulk Address Validation Tool for Lists or Command Line Tool. Like I said, there are several services out there, but you'll want something, I presume, that verifies the existence of addresses too (hence the reason for geocoding) -- if the address is wrong or incomplete, so are the geocoding results. Only a few services do this.

LiveAddress is a service which is CASS-Certified by the USPS. There are a few out there so do your research, but you want something "on-the-fly"/quick and inexpensive so again I recommend LiveAddress. It'll not only verify the address but then do as you require which is supply lat/lon information and also the precision of the geocoding results. It's all web-based and will process tens of millions of records in no time (see this question as a reference).

If you have further need to geocode addresses as users are interacting, US Street Address also has an API version which can plug into just about anything and it also supports batch processing on-the-fly, but is paid as a subscription, not a one-time payment.


If you like Python, you could use the GeoPy API, combined with the GDAL Python bindings or Fiona, and create a very basic script like this for converting the addresses to a point shapefile.

This will geolocate a file named 'addresses_to_geocode', creating an output shapefile named 'my_output.shp' in my_output folder:

import os
from geopy import geocoders
from osgeo import ogr, osr

def geocode(address):
    g = geocoders.GoogleV3()
    place, (lat, lng) = g.geocode(address)
    print '%s: %.5f, %.5f' % (place, lat, lng)
    return place, lat, lng

def parse_file(filepath, output_shape):
    # create the shapefile
    drv = ogr.GetDriverByName("ESRI Shapefile")
    if os.path.exists(output_shape):
        drv.DeleteDataSource(output_shape)
    ds = drv.CreateDataSource(output_shape)
    # spatial reference
    sr = osr.SpatialReference()
    sr.ImportFromProj4('+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs')
    lyr = ds.CreateLayer(output_shape, sr, ogr.wkbPoint)
    # fields
    featDefn = lyr.GetLayerDefn()
    fld_id = ogr.FieldDefn('id', ogr.OFTInteger)
    fld_address = ogr.FieldDefn('ADDRESS', ogr.OFTString)
    fld_address.SetWidth(255)
    lyr.CreateField(fld_id)
    lyr.CreateField(fld_address)
    print 'Shapefile %s created...' % ds.name
    # read text addresses file
    i = 0
    f = open(filepath, 'r')
    for address in f:
        try:
            print 'Geocoding %s' % address
            place, lat, lng = geocode(address)
            point = ogr.Geometry(ogr.wkbPoint)
            point.SetPoint(0, lng, lat)
            feat = ogr.Feature(lyr.GetLayerDefn())
            feat.SetGeometry(point)
            feat.SetField('id', i)
            feat.SetField('ADDRESS', address)
            lyr.CreateFeature(feat)
            feat.Destroy()
            i = i + 1
        except:
            print 'Error, skipping address...'

parse_file('addresses_to_geocode', 'my_output')

The file is supposed to have just a line for a single address, like for example:

Via Benedetto Croce 112, Rome, Italy
Via Aristide Leonori 46, Rome, Italy
Viale Marconi 197, Rome, Italy

Here I am using the Google API, but with GeoPy is very basic to switch to differents API, like Yahoo!, GeoNames, or MapPoint.