Bulk lookup of address census tract and block

Ok Ben, here are my assumptions:

1) You've already got your data (I had some address points in a shapefile, and I downloaded census tract and census block shapefiles for Missouri).

2) You've already geocoded your address points and you're comfortable projecting the data.

3) You're comfortable with an OGR/PostGIS solution (both free).

Here are some install notes if you don't have these software: How to install PostGREs with PostGIS support. (By BostonGIS. Please don't take offense to their title, I just think it's the best how-to out there.) Also, here's one, two, and three sites describing how to install GDAL/OGR with Python bindings.

Caveat: Before performing the actual analysis (i.e. the ST_Contains stuff, below) you should ensure all your layers are in same projection! If you have shapefiles, it's easy to translate from one projection to another using either Quantum GIS (QGIS) or OGR (or ArcGIS if you have it). Alternatively, you could perform the projection transformation in the database using PostGIS functions. Basically pick your poison, or let us know if this is a stumbling block.

With those givens, this is how I appended tract and block attriutes to some address points data using PostGIS:

First I used ogr2ogr to import the three shapefiles into PostGIS:

Import addresses using ogr2ogr:

ogr2ogr -f "PostGreSQL" PG:"host=127.0.0.1 user=youruser dbname=yourdb password=yourpass" "E:\path_to\addresses.shp" -nln mcdon_addresses -nlt geometry

Import census tracts (Missouri) using ogr2ogr: The spMoWest suffix implies I have already translated my data to Missouri State Plane West Feet.

ogr2ogr -f "PostGreSQL" PG:"host=127.0.0.1 user=youruser dbname=yourdb password=yourpass" "E:\path_to\st_tract10_spMoWest.shp" -nln mo_tracts_2010 -nlt geometry

Import blocks data (Missouri): This one took awhile. In fact, my computer kept crashing and I had to put a fan on it! Oh also, ogr2ogr won't give any feedback, so don't get punchy; make sure to wait on it and it'll eventually finish.

ogr2ogr -f "PostGreSQL" PG:"host=127.0.0.1 user=youruser dbname=yourdb password=yourpass" "E:\path_to\st_block10_spMoWest.shp" -nln mo_blocks_2010 -nlt geometry

Once the data imports are accomplished, launch PgAdmin III (the PostGREs GUI), browse into your database and throw some quick maintenance commands so that PostGREsql will run faster using these new data:

vacuum mcdon_addresses;
vacuum mo_tracts_2010;
vacuum mo_blocks_2010;

Next, I was curious how many raw address points I imported, so I did a quick COUNT(*). I usually do a count at the start of a task like this to give me a foothold for "sanity checks" later on..

SELECT COUNT(*) FROM mcdon_addresses;
-- 11979

In the next phase, I created two new tables, gradually adding the tracts attributes, and then the blocks attributes, to my original address points table. As you'll see, the PostGIS ST_Contains function did the heavy-lifting, in each case creating a new table of points, each gaining the attributes of the tracts and blocks polygons they fell inside of.

Note! For brevity, I'm only taking a handful of fields from each table. You'll probably want almost everything. I say almost because because you'll need to omit the ogr_fid field (maybe even others?) from the tables you're combining, otherwise PostGREs will complain about both fields having the same name..

(P.S. I did some snooping around here while figuring this out: http://postgis.net/docs/manual-1.4/ch04.html)

Create a new table of address points with tracts attributes: Note I'm prefixing each output column with a hint disclosing which table it started in (I'll explain why below).

CREATE TABLE mcdon_addresses_wtract AS
SELECT 
  a.wkb_geometry,
  a.route AS addr_route, 
  a.box AS addr_box, 
  a.new_add AS addr_new_add, 
  a.prefix AS addr_prefix, 
  a.rdname AS addr_rdname, 
  a.road_name AS addr_road_name, 
  a.city AS addr_city, 
  a.state AS addr_state, 
  a.zip AS addr_zip,
  t.statefp10 AS tr_statefp10, 
  t.countyfp10 AS tr_countyfp10, 
  t.tractce10 AS tr_tractce10,  
  t.name10 AS tr_name10, 
  t.pop90 AS tr_pop90, 
  t.white90 AS tr_white90, 
  t.black90 AS tr_black90, 
  t.asian90 AS tr_asian90, 
  t.amind90 AS tr_amind90, 
  t.other90 AS tr_other90, 
  t.hisp90 AS tr_hisp90
FROM
  mcdon_addresses AS a,
  mo_tracts_2010 AS t
WHERE 
  ST_Contains(t.wkb_geometry, a.wkb_geometry);

Maintain the table so PostGREs continues to run smoothly:

vacuum mcdon_addresses_wtract;

Now I had two questions..

Did the ST_Contains actually work? ..and.. Does the number of addresses returned make sense given the data inputs I used?

I was able to answer both using the same query:

select count(*) from mcdon_addresses_wtract;
-- returns 11848

A quick reflection on the losses: First, I checked in ArcGIS (you could also do this in QGIS) and it returned the same count. So, why the difference? First, some addresses fell outside of Missouri, and I only compared against a Missouri tracts polygon. Second, on closer analysis, it seems there were some examples of bad digitizing in the addresses data. Specifically, many of the points not caught by ST_Contains had empty attribute fields, which is a good sign something went foul during digitizing; it also means they weren't useable data anyway. At this point, I'm comfortable with the differences as I could reasonably go back and improve the data, allowing for a cleaner analysis.

Moving on, the next step was appending the address/tracts table with attributes from the blocks data. Similarly, I did this by creating a new table, once again prefixing each output field to indicate the table it came from (the prefixing is quite important you'll see):

CREATE TABLE mcdon_addr_trct_and_blk AS
SELECT 
  a.*,
  b.pop90 AS blk_pop90, 
  b.white90 AS blk_white90, 
  b.black90 AS blk_black90, 
  b.asian90 AS blk_asian90, 
  b.amind90 AS blk_amind90, 
  b.other90 AS blk_other90, 
  b.hisp90 AS blk_hisp90
FROM 
  mcdon_addresses_wtract AS a,
  mo_blocks_2010 AS b
WHERE
  ST_Contains(b.wkb_geometry, a.wkb_geometry);

Of course, maintain the table:

vacuum mcdon_addr_trct_and_blk;

The reason I prefixed each output field was because if I didn't, some fields would have the same names, and it would be impossible to distinguish them from one another in the final product (also.. PostGREs may have complained midway into this, but since I was renaming, I didn't give it the chance). Consider, for instance, the following two fields from both steps, above. You can see why I renamed them..

t.pop90 AS tr_pop90   -- would have been simply pop90
b.pop90 AS blk_pop90  -- also would have been pop90 ! 

Now that we have an addresses with tracts and blocks dataset, dwe still have the same number of points?

select count(*) from mcdon_addr_trct_and_blk;
-- 11848 (thumbs up!)

Yes, we do! If you want, you can go ahead and delete the first table we created, mcdon_addresses_wtract. We no longer need it for the analysis.

As a last action, you may want to export your data from PostGREs into an ESRI shapefile so that you can view it with other programs, like ArcGIS (of note, QGIS can read the PostGIS data without issue). If you're interested, here's how you could perform the conversion using ogr2ogr:

ogr2ogr -f "ESRI Shapefile" "E:\path_to\addr_trct_blk.shp" PG:"host=127.0.0.1 user=youruser dbname=yourdb password=yourpass" "mcdon_addr_trct_and_blk"

Finally, when you run this command, you'll likely get some warnings like this:

Warning 6: Normalized/laundered field name: 'tr_statefp10' to 'tr_statefp'

This just means OGR had to shorten that field name, because the field name in a shapefile can only be so long.

Of course, this is only one of many ways to accomplish this job.


The FCC has an API: http://www.fcc.gov/developer/census-block-conversions-api