Datalinting

Data collection and curation is an imprecise process that can lead to the unintentional introduction of errors at any stage. Since we take data quality seriously at Urban Mapping, we needed an automated, repeatable method for identifying these errors in our vast data warehouse. In the 1970s, Bell Labs had similar issues in the 1970s with its library of C code. This prompted Stephen C. Johnson to coin the term lint  to describe these unintentional defects. He was subsequently compelled to create a program that examined source code and reported probable issues and inconsistencies. This launched a bit of a trend, and subsequently any programming language achieving a certain level of popularity grows a corresponding lint program. Here at Urban Mapping we started growing a datalint program.

How do we introduce data lint? Well, we occasionally experiment with changes to our schemas, and it may take time for automated data loading processes to be trained to respect the new schema. Depending on the nature of the schema change, these processes may deposit some lint in the database. Sometimes we manually edit data and metadata. And humans are, well, human and may miss things like orphan data categories, duplicate records, or invalid date ranges. Sometimes we encounter undesirable but tolerable inconsistencies across related data from different vendors. For example, statistical data for the fifty United States may or may not contain records for Puerto Rico.

Our first impulse when we identify some lint lying around is to just clean it up and move on. But a nagging question emerges: is there other data present that has this same problem? This occasionally motivates a deeper, perhaps automated inspection of the database. But then after the next schema migration, data load, or manual edit session perhaps we must ask the nagging question all over again. To prevent this, we’ve added a “data linting” step to our database release procedure. The datalint tests are implemented as plain old unit test cases. But instead of testing code, they walk the database and test the data. Because they use the same python/SQLAlchemy environment that we use for batch data loading and the Mapfluence API, the tests can be implemented on the various layers of abstraction that we are already maintaining. This prevents the phenomenon of having a bunch of different copy-paste scripts lying around and rotting. Our process can be encapsulated by the graphic below. Our data team is notified to run the datalint, new tests are written if necessary (and fail), issue is fixed and datalinting is repeated (successfully) and stakeholders are notified of the correction.

datalint-graphic

High-level datalinting workflow

The datalint output provides productive feedback to our data wizards. On the low level, the comprehensive log file pin points the potential issues within the data. On the high level, the datalint test suite reports the percentage of lint tests that are passing, and each data lint test reports the percentage of passing groups of data. Below is a sample output log. This offers a quantitative insight into the question: “how healthy is the data?” Datalinting has proven to be an invaluable practice for the Mapfluence team. Less lint, more awesome.

Check a set of versions based on their dataset id ... FAIL
Check if geometrytable_set is over-inclusive ... ok
Check all attribute records for a valid geometry version id ... FAIL
Check all attribute records for a version id ... ok
Check attribute records with geometry tables not in geometrytable_set ... ok
Check all versions are being used in a geometry or attribute table ... ok

======================================================================
FAIL: Check a set of versions based on their dataset id
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/mapfluence-server/datalint/test_version.py", line 37, in test_dataset_version_overlap
    ['   %s' % x for x in v]) for k, v in flagged.iteritems()]))
AssertionError:
2 overlapping versions found.
umi.us_census_cd has overlapping versions.
   None:1244 => 2003-01-03 05:00:00 to 2013-01-01 00:00:00
   None:1247 => 2009-01-01 05:00:00 to 2013-01-03 00:00:00
   None:1245 => 2013-01-01 00:00:00 to None
   None:1246 => 2013-01-03 00:00:00 to None
umi.us_coli has overlapping versions.
   None:1778 => 2009-04-01 05:00:00 to 2009-07-01 05:00:00
   None:1777 => 2009-07-01 05:00:00 to 2010-04-01 05:00:00
   None:1781 => 2010-04-01 04:00:00 to 2010-07-01 04:00:00

======================================================================
FAIL: Check all attribute records for a valid geometry version id
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/mapfluence-server/datalint/test_version.py", line 100, in test_null_geometry_version
    for k, v in flagged.iteritems()]))
AssertionError:
1 attribute tables with null geometry version ids found

Null geometry version ids found for umi.unemployment.attributes. Available versions are:
   None:1687 => 1976-01-01 00:00:00 to 1976-02-01 00:00:00
   None:1582 => 1976-02-01 00:00:00 to 1976-03-01 00:00:00
   ...
   ...
   ...
   None:1429 => 2013-11-01 00:00:00 to 2013-12-01 00:00:00
   None:1371 => 2013-12-01 00:00:00 to 2014-01-01 00:00:00

----------------------------------------------------------------------
Ran 6 tests in 837.382s

FAILED (failures=2)

As datalinting grows up into a mature part of our development cycle, potential improvements will surface. One key enhancement we’ve identified is adding varying levels rather than a binary pass/fail rating within our python testing suite, nose. This is because not all issues can be fixed with a cursory glance at the datalinting log. There may be customers relying on a data defect, or additional thought is needed before applying a data patch. It may not be practical to remove all lint from the system in a single dev cycle. One of our main priorities is to allow our tests to give warnings as well as skip, pass, and fail. An example would be to utilize a warning option  for each unit test specifying when to warn on failure:

def test_orphan_records(self):
    self.warnOn(...)

def test_null_geometry(self):
    self.assertEqual(...)

With more tolerant log output:

Check a set of versions based on their dataset id ... WARNING
Check if geometrytable_set is over-inclusive ... ok
Check all attribute records for a valid geometry version id ... FAIL
Check all attribute records for a version id ... ok
Check attribute records with geometry tables not in geometrytable_set ... ok
Check all versions are being used in a geometry or attribute table ... ok

======================================================================
WARNING: Check a set of versions based on their dataset id
----------------------------------------------------------------------
AssertionError:
2 overlapping versions found.
...
...
...

======================================================================
FAIL: Check all attribute records for a valid geometry version id
----------------------------------------------------------------------
AssertionError:
1 attribute tables with null geometry version ids found
...
...
...

----------------------------------------------------------------------
Ran 6 tests in 837.382s

PASSED (passing=4)  67%
WARN   (warning=1)  17%
FAILED (failures=1) 17%

Check it out! Shiny New Website

Check out our new website!

We’ve recently updated the look of our corporate and developer websites. Oooh la la, right?

Custom styling and graphics were done in house by our front-end wizard Adam Van Lente. We’ve also achieved our stretch goal of making all of the content on our corporate site fully compatible with tablet and mobile.

Our new corporate website highlights our new offerings right alongside the achievements that have brought Urban Mapping ten years’ of success delivering geospatial content, performance, and analysis.

  • Mapfluence is the engine that drives our data catalog, spatial querying and mapping platforms, providing sophisticated map overlays, targeted spatial queries, geocoding, and more.
  • Business Intelligence is what we deliver through solutions that integrate Mapfluence into existing software produced by companies like Tableau and CoStar.
  • Marketing Automation is a new offering that targets the unique needs of marketers and others who benefit from the ability to append information about leads based on their demographic profile and geographic location.
  • Adtech & Web Publishing allows advertisers and web application developers to easily integrate hyperlocal context into the user experience.
  • Data Sourcing continues to be among our strong offerings, as we continually turn our researchers toward new challenges in developing custom geographic data sets.
  • Neighborhoods is the data for which we are more famous, we continue to update our worldwide database with new neighborhoods every quarter, and offer both licensing options and on-demand access through our APIs.

Take a look and tell us what you think!

Urban Mapping Q1 2014 Neighborhood Boundary Update

We’re pleased to announce our latest update to neighborhood boundaries, the product that put us on the (map!) in 2006. This release features increased coverage in over 370 cities across 7 new countries.  This brings global coverage to more than 127,000 neighborhoods across 40 countries. Our sights are now set on our Q2 update which will include new attributes and expanded coverage.

More, better neighborhoods

Good things come during the holidays, like updated neighborhood data from Urban Mapping! This quarter we continued to expand coverage to include over 120,000 neighborhoods 34 countries. Equally important is responding to evolving user needs. We’ve expanded our Local Relevance attribute from a boolean to indexed value. New this quarter is an Editorial Sensitivity. These new attributes mean developers have even flexibility in creating applications tailored to customer needs.

Local Relevance is an indexed value based on a variety of metrics of social/cultural significance and popularity. Using a mix of log data and social media, Urban Mapping created an attribute that measures importance, allowing developers to whittle down a comprehensive set of neighborhoods to something more manageable. For example, in the five boroughs of New York City our database counts over 300 neighborhoods. For one kind of application, cultural and historical detail (think Alphabet City and Hell’s Kitchen) will be critical. For a mobile-based social networking application, dividing NYC into (say) 30 of the most important neighborhoods could be sufficient.

Editorial Sensitivity can be used to address scenarios where a publisher might want to “tone down” their property to cater to a given audience. As great as we think the “Funk Zone” in Santa Monica is, some publishers might feel differently!

If you are interested in learning more about our neighborhood boundary database, help yourself to an evaluation!

Neighborhood coverage Q4Y13

Urban Mapping Adtech and Web Publishing Solutions

Holidays bring good things of all kinds, including product announcements! This week we are live at The Kelsey Group’s Leading in Local conference in San Francisco. We’re announcing several online advertising products that provide increased geographic context, better user experience and increase monetization. What are these, you may ask?

  • GeoMods – a geotargeting tool that helps online advertisers increase effectiveness of campaigns. With access to Urban Mapping’s geographic warehouse of on-demand data, GeoMods provides a long tail of geo-expansion terms. With GeoMods, advertisers and agencies can perform truly hyperlocal geotargeting with accuracy far greater than traditional IP geotargeting. Try the demo!

  • Neighborhoods & Transit – Leverage Mapfluence to incorporate neighborhoods and transit data, a key one-two punch of local!  The usage based model allows publishers to tag business listings, housing or other content to provide increased user relevance. Developers can show neighborhoods and transit information on a map or index their content to create an enriched hyperlocal experience.

  • GeoLookup – Given the location of a mobile device, how can a user understand location? Maps are a good way, but often context is sufficient. Through reverse geocoding, developers can drill down and return place names associated with location. This is especially valuable with social/mobile applications where GeoLookup can serve as a filter to enforce user privacy.

To learn more, check out urbanmapping.com/adtech and register as a developer today!