How to scrape data from a google map (in flash)?

Quickly glancing at the page in Firebug and looking at the network calls, you can see where they are pulling the data from. Seems to be a couple of XML files, namely:

http://graphics8.nytimes.com/packages/xml/map_feed_victims.txt?c=2182

and

http://graphics8.nytimes.com/packages/xml/map_feed_incidents.txt?c=2182


+1 to @ericoneal's answer, but for the sake of noting an alternative approach, you could also download and install Fiddler. Fiddler routes your port-80 traffic through a proxy and provides you an interface for poking-around in the HTTP responses that follow your web request.

I'll describe the usage. In the screenshot, I just launched Fiddler, then opened your link in IE. All the data starts streaming-in without my doing anything else. Once it's settled, at left, I clicked on one of the returns (map_feed_incidents.txt, as noted by Eric), then at top-right, I select Inspectors. The pane at bottom-right provides several inspection formats. I tried a few, and the screen shows the TextView.

At a glance, the content appears to be line-break and tab-delimited (it's definitely not real XML). The top line specifies the file format, and every other line is an incident record. Here's the top-line and first record from the _incidents file (scroll right and note the id field):

LAT:DOUBLE  LONG:DOUBLE incident_date:STRING    incident_time:STRING    boro:STRING num_victims:INTEGER primary_motive:STRING   id:INTEGER  weapon:STRING   light_dark:STRING   year:INTEGER
40.665626   -73.909699  01/01/08    02:09   Brooklyn    1       7   gun D   2008

The lat/long is obvious. The other two files (_victims and _perpetrators) use the same approach. Here's the top line and first record from the _perps table:

incident_id:INTEGER sex:STRING  race:STRING age:INTEGER
7   M   B   20

The presence of incident_id is useful. Both _victims and _perps have this column, and it relates their data back to the geo-tagged _incidents table using that table's id column.

enter image description here

As an aside.. I have to agree with George and wonder why they included the victim's name. That seems like a major ethical oversight. While it's meaningless as a mapped attribute, I would not be surprised to see the perpetrator's name. But the victim's? At first I thought this may have been an unused element in the data payload, but it's really in the map?!?! That's a very questionable decision, and it leads me to believe nobody is using that map. Otherwise I think some criticism would've emerged from the general public.


I don't know if you can get the exact same data from the NYC Open data repository, but here is a link.


A slightly different approach could be to try to gather the data using the New York Times API: http://prototype.nytimes.com/gst/apitool/index.html

Tags:

Google Maps