(Mis)adventures in Open Data Land

(or, A Cautionary Tale of Geocoding)

So in this new era of open data, data is free, right? If you’ve never tried using open data, it’s harder than you might think.

For the Tableau Customer Conference last month, we thought it would be fun to show off some data that was relevant to the conference location: Washington, D.C.

Our original idea was to associate D.C. restaurant health code violation scores to buildings to provide a simple and reasonable sense of which buildings might not have the best food facilities. Building outlines or “footprints” are available from the D.C. government and OpenStreetMap, tagged with restaurant names. OSM data can be obtained many ways; the easiest might be Metro Extracts. These should not be confused with parcel boundaries, which are tied to property lines.

Enter data quality challenges. Geocoding is complicated and “free” data isn’t ever really free.

When we downloaded the restaurant health code violation data, it quickly became apparent that the geographic component of the data suffered from a fundamental problem that limited us from spatially linking the latitude and longitude of restaurants with the building outline data: the geocoder used to obtain the coordinates placed about half of the restaurants on the street and the rest were located somewhere within the property (i.e., parcel) boundary. This is a challenge as we wanted to tie the actual restaurant location to a polygon, not a point.

geocoding1Fictitious street grid to illustrate geocoding and distinctions
between property parcels and building footprints.

Before I explain further, I need to provide some terminology disambiguation. Within Tableau, geocoding means displaying data geographically. However, in common geospatial parlance, geocoding has a very specific meaning: a geocoder is a tool that is used to derive latitude and longitude from a human-readable street address. In short, geocoding makes address data geographic, but unless you understand the assumptions made in the geocoding process, your geocoding may not be useful.

While we are talking terminology, a composite geocoder is a geocoder that will try to find a point to assign to an address, but if it fails, will fall back on whatever part of the address it understands. For example, if you provide an invalid address in Bismarck, North Dakota, the composite geocoder will return a point in the center of Bismarck, North Dakota. If you give it a nonsensical street name and spell Bismarck like “Bismark”, it may return a point in the center of North Dakota. Rather than raising a flag, it gives you an answer, albeit a less accurate one. Failing all else, your geocoder will return a value of NULL, which, if interpreted by a mapping client, will represent as the latitude and longitude coordinates (0,0), aka Null Island.

null islandImage source

Finally, rooftop geocoding returns a latitude and longitude that will land on the building that matches the address. Not all geocoders are designed to do this – for example, it is unnecessary or even disadvantageous for a geocoder built for navigation and routing to resolve addresses to rooftops. It just matters that you get there, not whether it’s a valid address. Most so-called rooftop geocoders match the parcel, but may drop the point in the parking lot rather than on the building. With a little extra geospatial wizardry, you can assign a parcel-level geocode to a building footprint. However, a geocode to the middle of the street tells you nothing about which building or parcel the point belongs to.

geocoding2Same fictitious street grid to illustrate different levels of geocoding accuracy. 

Returning to our restaurant health code violation scores, it appears the data was geocoded with a composite geocoder that took a first pass through a rooftop geocoder, but failing that, assigned coordinates according to an address range of a street, etc…. We would have to invest significant time in data cleaning, re-geocoding, and manually placing points to assign all restaurant locations to a building footprint.

Instead we scrapped our efforts, and found other data. Sometimes life is too short for data cleaning. (Ed note: please read our companion post about the viz we did build).

The question of how to build the most precise geocoder isn’t an easy one, but it is one we think about a lot. There is no solution that would not require an enormous amount of information about address and building configurations on the ground, but clearly It doesn’t make sense to drop one point at a time to generate hundreds of thousands of locations for civic engagement or business intelligence. In sum, the details of data matter, and dealing with them is less glamorous and more important than most of everything else you will do when creating a (geographic) data visualization.