It’s Release Day! Urban Mapping Q3 Neighborhood Boundary Update

We’re excited to announce our Q3 neighborhood boundaries update. Our focus continues on expanding international coverage, and our boundaries dataset now covers approximately 129,000 neighborhoods in 49 countries. Our roadmap for the next four releases will continue the theme of expanded international coverage as research is completed in 10 countries.

A Bit of git

To merge or to rebase, that is the question. I prefer rebase. The linear history preserves the utility of git bisect. Also, rebase forces developers to resolve their conflicts where they are introduced instead of in one possibly big and unreadable merge commit at the end. Of course, rebase can be tricky and cause some headaches. But this is not the next installment of rebase vs. merge. Here I describe the simple rebase-friendly branching model that we use here at Urban Mapping. This process is inspired in part by gitflow, which is a worthy read and beautifully presented. And of course gitflow can work with rebase to some extent. But that beautiful gitflow documentation is written with merge in mind.  So to communicate when to rebase, I invited the usual help from Bob and Alice, and their release manager Reggie.

Step 1: Release 1.0

Step 1: Release version 1.0

The master branch is where the latest developments go.  It’s the end of the 1.0 development cycle. A code freeze is declared and intensive testing begins. Once the code stabilizes, Reggie tags the release and pushes it:

git tag 1.0
git push origin 1.0

The release automation scripts can now deploy version 1.0 something like this:

git clone git@github.com:urbanmapping/myproject.git
cd myproject
git checkout -b release-1.0 1.0

Hooray.  1.0 is out the door. Now on to the 2.0 development cycle.

Step 2: Create new feature branches create feature branches

Alice is working on a new green feature, and Bob is working on a new purple feature. They start by creating appropriate feature branches locally from the master branch.  Here’s how Alice does it for the green feature:

cd /path/to/myproject
git fetch -a
git checkout -b green origin/master
[work, git commit, git rebase -i, git commit --amend, etc.]
git push origin green:green

That fetch grabs anything from the upstream master branch. Then checkout creates a new green branch in Alice’s local tree and checks it out. Now actual work can ensue. Finally, the push creates the new green feature branch in the origin tree.  The point of this last step is to back up the branch, and possibly to share the green with other developers.  Bob uses the same commands to create his purple feature branch.

Step 3: Push features to master

push green feature to master

After the developers are satisfied with their features, they push them to master. Alice goes first:

git fetch -a
git rebase origin/master
git push origin green:master

By the way, this all happens with the green branch checked out. As in the above step, fetch ensures that the master branch is up to date with the upstream.  rebase will set aside the green patches, bring in any new patches from the master branch, then put the green patches back on top.  Of course at this moment master does not have any new patches, so rebase reports Current branch green is up to date.  Finally, push pushes the green patches to master.

Bob goes next.  He executes the same commands as Alice. However, because the green features are already in master, the fetch grabs them and stores them locally. Subsequently, the rebase inserts the green patches resulting in the state shown below:

rebase feature and push to master
If any of Bob’s purple patches conflict with any of Alice’s green patches, Bob will be prompted to resolve the conflicts before he can push to master. If Bob attempted to push without first rebasing, his push would fail with a non fast-forward error. He could, of course, push with the --force, but this would blow away Alice’s green patches.  So he shouldn’t do that.

Step 4: Release 2.0

tag and release 2.0

Now let’s suppose that this represents all the work for the next release. Reggie tags the release and deploys it as described above.  Hooray.  Another step forward.  The green and purple feature branches can now be deleted.

Step 5: Create next round of features

push orange feature to master

On to the next release cycle!  Alice starts up an orange feature branch as described above, and pushes to master to make the next step forward.  Meanwhile, a critical bug is discovered in production and Bob is tasked to fix it.

Step 6: Prepare release 2.1, a hotfix on 2.0

prepare hotfix

Bob must fix the production bug. But where should he push his code? He could fix the bug at the end of master and then push, but Alice’s orange feature is not really tested and reviewed yet. And Reggie will not accept that in production! Instead, Bob creates a local release branch based on the 2.0 tag and prepares the hotfix:

git checkout -b release-2.1 2.0
[work, git commit, git rebase -i, git commit --amend, etc.]

The checkout creates a local branch called release-2.1 where the hotfix patch will be created. After getting the red patch in place just as he likes it, Bob tags the hotfix appropriately and pushes it:

git tag 2.1
git push origin 2.1

The tag creates the tag for the new version, and the push ensures that the fix is available in the origin tree. Note that Bob never actually pushes this release branch. He only pushes the tag. Now Reggie’s release scripts can grab the new version in the same way as before:

git clone git@github.com:urbanmapping/myproject.git
cd myproject
git checkout -b release-2.1 2.1

Step 7: Push hotfix patches to master

rebase and push hotfix to master

Finally, Bob’s 2.1 patch must make it to master somehow. To achieve this, he rebases the release branch to master and pushes it, just like with any other feature branch.

git fetch -a
git rebase master
git push release-2.1:master

Discussion

When first discussing this step with colleagues, I was skeptical about having hotfix commits floating around in the origin tree that are not reachable by a branch. We considered some alternatives. One would be to tolerate git merge for hotfixes. After all, hotfixes are generally made up of a minimal number of small patches.  If the patches are many and/or large, then the bug they are meant to fix is probably best addressed by a downgrade instead of a hotfix. This mitigates most of my resistance to merge. Specifically, if the hotfix patches causes a merge conflict, that conflict is probably small and easy to ascribe to a specific patch. So this is one alternative.

Another alternative that we considered was to host permanent release stabilization branches in the origin tree.  Under this model, hotfixes would go on the release stabilization branch and be cherry-picked over to the tip of master.  For a project that delivers supported software as opposed to a service this may be a sensible alternative because it provides a clean place to succinctly reproduce and resolve customer issues.

Datalinting

Data collection and curation is an imprecise process that can lead to the unintentional introduction of errors at any stage. Since we take data quality seriously at Urban Mapping, we needed an automated, repeatable method for identifying these errors in our vast data warehouse. In the 1970s, Bell Labs had similar issues in the 1970s with its library of C code. This prompted Stephen C. Johnson to coin the term lint  to describe these unintentional defects. He was subsequently compelled to create a program that examined source code and reported probable issues and inconsistencies. This launched a bit of a trend, and subsequently any programming language achieving a certain level of popularity grows a corresponding lint program. Here at Urban Mapping we started growing a datalint program.

How do we introduce data lint? Well, we occasionally experiment with changes to our schemas, and it may take time for automated data loading processes to be trained to respect the new schema. Depending on the nature of the schema change, these processes may deposit some lint in the database. Sometimes we manually edit data and metadata. And humans are, well, human and may miss things like orphan data categories, duplicate records, or invalid date ranges. Sometimes we encounter undesirable but tolerable inconsistencies across related data from different vendors. For example, statistical data for the fifty United States may or may not contain records for Puerto Rico.

Our first impulse when we identify some lint lying around is to just clean it up and move on. But a nagging question emerges: is there other data present that has this same problem? This occasionally motivates a deeper, perhaps automated inspection of the database. But then after the next schema migration, data load, or manual edit session perhaps we must ask the nagging question all over again. To prevent this, we’ve added a “data linting” step to our database release procedure. The datalint tests are implemented as plain old unit test cases. But instead of testing code, they walk the database and test the data. Because they use the same python/SQLAlchemy environment that we use for batch data loading and the Mapfluence API, the tests can be implemented on the various layers of abstraction that we are already maintaining. This prevents the phenomenon of having a bunch of different copy-paste scripts lying around and rotting. Our process can be encapsulated by the graphic below. Our data team is notified to run the datalint, new tests are written if necessary (and fail), issue is fixed and datalinting is repeated (successfully) and stakeholders are notified of the correction.

datalint-graphic

High-level datalinting workflow

The datalint output provides productive feedback to our data wizards. On the low level, the comprehensive log file pin points the potential issues within the data. On the high level, the datalint test suite reports the percentage of lint tests that are passing, and each data lint test reports the percentage of passing groups of data. Below is a sample output log. This offers a quantitative insight into the question: “how healthy is the data?” Datalinting has proven to be an invaluable practice for the Mapfluence team. Less lint, more awesome.

Check a set of versions based on their dataset id ... FAIL
Check if geometrytable_set is over-inclusive ... ok
Check all attribute records for a valid geometry version id ... FAIL
Check all attribute records for a version id ... ok
Check attribute records with geometry tables not in geometrytable_set ... ok
Check all versions are being used in a geometry or attribute table ... ok

======================================================================
FAIL: Check a set of versions based on their dataset id
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/mapfluence-server/datalint/test_version.py", line 37, in test_dataset_version_overlap
    ['   %s' % x for x in v]) for k, v in flagged.iteritems()]))
AssertionError:
2 overlapping versions found.
umi.us_census_cd has overlapping versions.
   None:1244 => 2003-01-03 05:00:00 to 2013-01-01 00:00:00
   None:1247 => 2009-01-01 05:00:00 to 2013-01-03 00:00:00
   None:1245 => 2013-01-01 00:00:00 to None
   None:1246 => 2013-01-03 00:00:00 to None
umi.us_coli has overlapping versions.
   None:1778 => 2009-04-01 05:00:00 to 2009-07-01 05:00:00
   None:1777 => 2009-07-01 05:00:00 to 2010-04-01 05:00:00
   None:1781 => 2010-04-01 04:00:00 to 2010-07-01 04:00:00

======================================================================
FAIL: Check all attribute records for a valid geometry version id
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/mapfluence-server/datalint/test_version.py", line 100, in test_null_geometry_version
    for k, v in flagged.iteritems()]))
AssertionError:
1 attribute tables with null geometry version ids found

Null geometry version ids found for umi.unemployment.attributes. Available versions are:
   None:1687 => 1976-01-01 00:00:00 to 1976-02-01 00:00:00
   None:1582 => 1976-02-01 00:00:00 to 1976-03-01 00:00:00
   ...
   ...
   ...
   None:1429 => 2013-11-01 00:00:00 to 2013-12-01 00:00:00
   None:1371 => 2013-12-01 00:00:00 to 2014-01-01 00:00:00

----------------------------------------------------------------------
Ran 6 tests in 837.382s

FAILED (failures=2)

As datalinting grows up into a mature part of our development cycle, potential improvements will surface. One key enhancement we’ve identified is adding varying levels rather than a binary pass/fail rating within our python testing suite, nose. This is because not all issues can be fixed with a cursory glance at the datalinting log. There may be customers relying on a data defect, or additional thought is needed before applying a data patch. It may not be practical to remove all lint from the system in a single dev cycle. One of our main priorities is to allow our tests to give warnings as well as skip, pass, and fail. An example would be to utilize a warning option  for each unit test specifying when to warn on failure:

def test_orphan_records(self):
    self.warnOn(...)

def test_null_geometry(self):
    self.assertEqual(...)

With more tolerant log output:

Check a set of versions based on their dataset id ... WARNING
Check if geometrytable_set is over-inclusive ... ok
Check all attribute records for a valid geometry version id ... FAIL
Check all attribute records for a version id ... ok
Check attribute records with geometry tables not in geometrytable_set ... ok
Check all versions are being used in a geometry or attribute table ... ok

======================================================================
WARNING: Check a set of versions based on their dataset id
----------------------------------------------------------------------
AssertionError:
2 overlapping versions found.
...
...
...

======================================================================
FAIL: Check all attribute records for a valid geometry version id
----------------------------------------------------------------------
AssertionError:
1 attribute tables with null geometry version ids found
...
...
...

----------------------------------------------------------------------
Ran 6 tests in 837.382s

PASSED (passing=4)  67%
WARN   (warning=1)  17%
FAILED (failures=1) 17%

Check it out! Shiny New Website

Check out our new website!

We’ve recently updated the look of our corporate and developer websites. Oooh la la, right?

Custom styling and graphics were done in house by our front-end wizard Adam Van Lente. We’ve also achieved our stretch goal of making all of the content on our corporate site fully compatible with tablet and mobile.

Our new corporate website highlights our new offerings right alongside the achievements that have brought Urban Mapping ten years’ of success delivering geospatial content, performance, and analysis.

  • Mapfluence is the engine that drives our data catalog, spatial querying and mapping platforms, providing sophisticated map overlays, targeted spatial queries, geocoding, and more.
  • Business Intelligence is what we deliver through solutions that integrate Mapfluence into existing software produced by companies like Tableau and CoStar.
  • Marketing Automation is a new offering that targets the unique needs of marketers and others who benefit from the ability to append information about leads based on their demographic profile and geographic location.
  • Adtech & Web Publishing allows advertisers and web application developers to easily integrate hyperlocal context into the user experience.
  • Data Sourcing continues to be among our strong offerings, as we continually turn our researchers toward new challenges in developing custom geographic data sets.
  • Neighborhoods is the data for which we are more famous, we continue to update our worldwide database with new neighborhoods every quarter, and offer both licensing options and on-demand access through our APIs.

Take a look and tell us what you think!

Urban Mapping Q1 2014 Neighborhood Boundary Update

We’re pleased to announce our latest update to neighborhood boundaries, the product that put us on the (map!) in 2006. This release features increased coverage in over 370 cities across 7 new countries.  This brings global coverage to more than 127,000 neighborhoods across 40 countries. Our sights are now set on our Q2 update which will include new attributes and expanded coverage.

More, better neighborhoods

Good things come during the holidays, like updated neighborhood data from Urban Mapping! This quarter we continued to expand coverage to include over 120,000 neighborhoods 34 countries. Equally important is responding to evolving user needs. We’ve expanded our Local Relevance attribute from a boolean to indexed value. New this quarter is an Editorial Sensitivity. These new attributes mean developers have even flexibility in creating applications tailored to customer needs.

Local Relevance is an indexed value based on a variety of metrics of social/cultural significance and popularity. Using a mix of log data and social media, Urban Mapping created an attribute that measures importance, allowing developers to whittle down a comprehensive set of neighborhoods to something more manageable. For example, in the five boroughs of New York City our database counts over 300 neighborhoods. For one kind of application, cultural and historical detail (think Alphabet City and Hell’s Kitchen) will be critical. For a mobile-based social networking application, dividing NYC into (say) 30 of the most important neighborhoods could be sufficient.

Editorial Sensitivity can be used to address scenarios where a publisher might want to “tone down” their property to cater to a given audience. As great as we think the “Funk Zone” in Santa Monica is, some publishers might feel differently!

If you are interested in learning more about our neighborhood boundary database, help yourself to an evaluation!

Neighborhood coverage Q4Y13

Urban Mapping Adtech and Web Publishing Solutions

Holidays bring good things of all kinds, including product announcements! This week we are live at The Kelsey Group’s Leading in Local conference in San Francisco. We’re announcing several online advertising products that provide increased geographic context, better user experience and increase monetization. What are these, you may ask?

  • GeoMods – a geotargeting tool that helps online advertisers increase effectiveness of campaigns. With access to Urban Mapping’s geographic warehouse of on-demand data, GeoMods provides a long tail of geo-expansion terms. With GeoMods, advertisers and agencies can perform truly hyperlocal geotargeting with accuracy far greater than traditional IP geotargeting. Try the demo!

  • Neighborhoods & Transit – Leverage Mapfluence to incorporate neighborhoods and transit data, a key one-two punch of local!  The usage based model allows publishers to tag business listings, housing or other content to provide increased user relevance. Developers can show neighborhoods and transit information on a map or index their content to create an enriched hyperlocal experience.

  • GeoLookup – Given the location of a mobile device, how can a user understand location? Maps are a good way, but often context is sufficient. Through reverse geocoding, developers can drill down and return place names associated with location. This is especially valuable with social/mobile applications where GeoLookup can serve as a filter to enforce user privacy.

To learn more, check out urbanmapping.com/adtech and register as a developer today!