What problems are solved by splitting street addresses into individual columns?

I spent 7 years developing software for a publishing company and one of the hardest problems we ever tackled was parsing street addresses in subscription lists. It is useful to split up addresses into distinct fields, but you can never, EVER design for every possible pathological aberration of address formats and components the human brain can devise.

Every locality can have its quirks, and that's just in the US. Throw in other countries and things get unmanageable very quickly for any approach that wants to parse every address. Just two examples:

In Spain, the street number always comes after the street name and a comma, and many addresses contain a floor number ordinal, such as 1° or 3ª, along with abbreviations for "left" ("Izda" meaning left-hand-door after you get up the stairs), "right" ("Dcha") or other possibilities. Now multiply that quirkiness by the number of different countries and areas with different historical customs for addresses... (Japan? Rural England? Korea? China?)

In Portland, OR, there are N-S and E-W axes that divide the city into NW, NE, SW and SE quadrants (as well as a N "quadrant", but I digress). N-S streets are numbered incrementally East and West from this axis, and addresses on E-W streets are dictated by the N-S street number being the "hundred block" of the number (i.e. a house on an E-W street between 11th and 12th avenues would have a number like 1123). Pretty standard stuff for US addresses.

Every so often you run into a Portland address like 0205 SW Nebraska St. A leading zero? WTF? There goes my integer column for house "number".

When the grid was set up, the N-S axis was defined by the Willamette river. Everything to the East of the river was NE or SE, and West of the river NW or SW. As the city grew south they ran into the inconvenient fact that the river meanders to the East, so projecting the axis South you have this problematic area that's on the "West" side of the river but East of the axis. The solution was to add a leading zero, in effect a minus sign, with the numbers incrementing towards the East from the axis line.

If I were you I'd give up hope of designing the ultimate system. You cannot cover all possibilities, and new ones will be created as humanity pushes into previously undeveloped land.

For US addresses, take a look at what the USPS has already done in address standardization, and remember to make the house_number column a varchar. While you're at it figure out how you're going to parse 1634 E N Fort Lane Ave.

For the rest of the world, I'd probably try to abstract additional fields to cover 80-90% of what is likely to come up, and provide a set of uninterpreted fields that can handle everything else when necessary. I.e. if your parser fails to handle an address, save it unparsed and flagged as such. If you do manage to parse an address, make sure you remember the order in which you found the various fields so you can reassemble it into something deliverable.

I was going to say that the most important field is going to be post code, but even that is not a given in many places.

Good luck. This can be a fun and extremely frustrating endeavor but the key to sanity is to know when to quit trying and just store the input unparsed, or partially parsed with the original input as backup.


Problems that can be solved by splitting include

Validation Any one part of the name can be compared to a master list. Those which do not match can be rejected. Postcode / zipcode is an obvious example. These are issued and maintained by an independent authority. The only valid ones are those issued by that authority.

Sorting and Selection I have seen cases where postal charges are reduced if mail is handed to the delivery service already organised to some extent. Having the corresponding columns produces tangible business value.

Analysis It can be useful to know where your orders are going, in a geographically hierarchical way. This may drive sales initiatives, product development or commission payments etc.

Code Duplication By having all applications in an organisation adopt the same data model (that of the most complex consumer), a single code base can be adopted enterprise-wide and maintained consistently. Endlessly duplicated hair splitting can be avoided, or at least delegated to the propellerheads. Addresses held by different parts of the organisation can be updated consistently. Customer service and satisfaction can be increased. Development effort can concentrate on the unique, high value parts of a system.

Legal Issues Laws and taxes vary by jurisdiction. By capturing the detailed address values separately it is easier to cross-reference transactional data to compliance requirements.

Duplication It is simple to spoof addresses held as text by moving one element to the next line or resequencing some parts. Fully parsed addresses are easier to compare. This may be a simple data quality issue, or may have compliance or credit implications if, say, multiple shell companies make large orders to the same delivery address, or a credit card is used to deliver to many dispersed locations in a short period.

Formatting Parts held separately can be combined in whatever fashion suits the current need. If, say, long thin print labels become cheap you can reformat to use them.

Of course none of these may apply to any specific application. Data of this type is much easier to parse and validate at source, when collected, than it ever will be in post analysis. So even if YAGNI it may be better to put the extra effort in up front for little cost and a potential large future saving.

Finally, I wouldn't dismiss the human factor. The data model is produced by data modellers. It's what they do. That's their profession. They're not going to tell you to just dump it in a BLOB, are they?


Like all design questions, there's a hugely qualified "it depends". It depends on your data story - how the data is collected, how it is used, how it gets updated, etc. All my comments should be taken as discussion points, not how-to answers.

It sounds like* you could benefit more from using an address validation service than trying to build one for yourself. While they are costly, many such services come with significant mailing discounts.

Of course, there is a compromise here, for certain data stories. You can persist the parsed out address pieces and create a computed column (set of columns, likely) for the combined address. This is an implementation answer, with all the normal caveats implied.

I have implemented the parsed out address design. We absolutely needed this for data quality AND data processing needs. But that was a business that had physical addresses, postal addresses, virtual addresses, etc.

The other issue that can come up is that different postal services require the same information to be presented in different formats/orders/etc. So having the parts modeled out supports presenting the same info in a variety of formats and layouts.

Finally, you don't need to have international business operations to have to support international data. Even US-based businesses need to support international addresses. It's a huge data mistake to assume you will never have that. Customers move, vendors change HQs, vendor contact info can be international even if they have a US HQ. Even if your current systems made that mistake, you don't want to carry this one forward.

I highly recommend the writings and blogging by Graham Rhind. He's the expert in the data field about addresses of all kinds and the trade-offs associated with them.


*All I've said here is a gross generalization. There are so many questions I'd have to help come to a design solution that it might take a few hours of chatting. Likely some pictures and some data profiling, too. And then a lot of really quirky data stories about addresses.