Would you consider online geocoding a breach of privacy?

There is definitely a privacy implication here - particularly if you are working with small batches of data. Anyone who is attempting to mine the data stream will be able to make assumptions that all requests in the same batch have something in common - even if the medical condition or personal information is not disclosed over the wire.

A better technique is to batch up lots of unrelated data / patients for bulk geocoding.

For example - combine your data needing geocoding with other researchers - the more unrelated issues the better. Randomize the order of the requests. And once per day batch process through this queue, all at once.

Now it becomes vastly harder to mine the data, even if an attacker is able to overhear the geocoding requests.


Geocoding locally with encrypted files on a secure server would definitely be the gold standard for privacy. Using Tor would be the next best thing, if geocoding using a remote API is needed.

Tor protects you by bouncing your communications around a distributed network of relays run by volunteers all around the world: it prevents ... the sites you visit from learning your physical location.

Along with injection of random addresses (as others here recommend) and using ssl (https) to encrypt communications to their endpoints (make sure you're also doing this), I can't think of a more secure way to geocode remotely than via the Tor Project. Whatever geocoding service you're using won't ever be able to identify where the requests ultimately came from, and with https no one else will, either. Note: don't use a geocoding service that requires an api key for this, or you'll no longer be anonymous. (Google doesn't require an api key anymore).

More details about using Tor are in my answer to a related question here.


This is an excellent question that I have been asked a number of times lately since I work for an address verification company called SmartyStreets.

First off, a postal address represents a single locatable point on the map. An address by itself is inherently benign because it doesn't have any additional information. Drawing a point on a map doesn't do anything. It is only when you begin to assign CONTEXT to that point (address) that it starts to mean something.

With that in mind, a postal address can represent a person, an organization, a building, a car, whatever. Once you start gathering multiple postal addresses you increase the context that can be derived from that grouping. Similarities can be determined to see what the addresses have in common. Still, just a grouping of addresses in a like area doesn't denote much context. I can look at a google map and see all the houses in a certain area. That's not a breach of privacy unless I have unauthorized access to privileged information.

Other points of context must be combined in order to actually give away any kind of private data. For example, a group of postal addresses that are submitted to an online service for address verification and/or geocoding doesn't give away information unless you know who submitted the list for processing. Once the list owner is known certain inferences can be made about the intended use of the list. Knowing this additional context, such as list owner and intended use, would certainly qualify as privileged information and can be a source of privacy breach.

Bringing the processing "in-house" so no external data service is involved is an option. It certainly excludes any type of unauthorized access to privileged information. Address verification and geocoding are not tasks for the uninitiated and certainly require advanced skills (meaning experience gained over time) in order to process very large lists without consuming inordinate amounts of time and resources. So bringing it in house is certainly an option, but does every company that has sensitive address information have the resources to do their own "secure" address processing (including geocoding) in house? No. (Although it would certainly mean job security for the readers of this website.)

There are ways to maintain the requisite privacy and still use online services. One method would be to create an account, get everything tested and figured out and then, using a temporary email address, set up a new account with an unrelated billing address associated with a credit card that can't be traced back to you. Processing the addresses on this account would theoretically not give away any valuable context and thus would maintain the privacy of the individuals on the list. (This is starting to sound like the movie Enemy Of The State.

If that sounds complex and unnecessary, I agree. A simpler method would be to take advantage of an API that uses HTTPS and POST and that doesn't store or log any of the data that you process. The use of HTTPS means that the only record would be a timestamp and the IP address that you call from. The underlying URL would not be known. Of course the account that you use would lead back to you BUT, that's not a problem because using a POST request allows you to attach a payload (in this case a batch of addresses) and the contents of the payload are not logged. Thus, the addresses that you submit are not on any server log. And the fact that they memory is wiped between each process means that those addresses aren't ever stored or logged and their transmission back to you is done over a secure connection. The end result is a log like this:

13Mar2012 06:31(-6) IP:12.134.223.12 UserID: 875564 -- POST QTY:3439942 -- [Processed]

Anyone that looks at the logs would see only that you processed some addresses and they would have no idea what addresses were processed. This satisfies even the strictest privacy policy requirements. It wouldn't make sense for me to point out that this type of service is available (and super fast) without mentioning where to find it. It's already built into the LiveAddress API service from SmartyStreets. Other services such as Cdyne, QAS, and ServiceObjects may also offer similar services but I haven't heard of any yet.