GitHub

GitHub

We haven’t quite figured out what the best way of both distributing the Who’s On First data and of accepting corrections or suggestions from community. Even though the nice people at GitHub continue to do excellent work at making Git easier for a broader population to use, the reality remains that Git is a significant barrier to participation for many people.

Absent a more formal decision about an alternative, GitHub at least allows us to point in the general direction of:

  • An open and readily distributed dataset that people can download and work with.
  • A way for people to contribute corrections (and general nuance) about a place.
  • A way for us to be able to do everything above while still assuring us a measure of authority around the
  • assertions we make about the data.
  • Also a way for us to think about how and where we store an audit trail (of sorts) for updates to a place.

There are some very real problems working with Who's On First data in Git repositories, still, so it's possible that we will stop using entirely.

Repo Naming Conventions

whosonfirst-data versus whosonfirst-data-SOMETHING repositories [Repo naming conventions]

There is a lot of data in Who's On First, more than can practically fit in a single GitHub repository. Someday it may be possible but today it is not. To account for this fact Who's On First data has been separated into a number of different GitHub repositories organized by placetype and region.

The naming conventions for repositories at their most granular is as follows:


whosonfirst-data + "-" + WHOSONFIRST_PLACETYPE + "-" + WHOSONFIRST_COUNTRY + "-" + WHOSONFIRST_SUBDIVISION

For example:

The first thing to note is that not all repositories are as granular as the rules described above. Wherever feasible we try to bundle records with the least amount of granularity as possible. For example postalcodes are grouped by country as are venues unless there are so many of them, like in the USA, that it is not practical to keep them in a single parent repository.

If a repository grows so much [data] that it is no longer practical to keep everything in one place then it may be subdivided into a number of child repositories. Venues are a good example of this.

We try to maintain a separate parent repository for things that have been broken out into multiple child repositories. For example there is a whosonfirst-data-postalcode repository that contains no data but instead a pointer to all the repositories that do have postalcode data. We also do the same for venues in the USA . This practice is still in its early stages so we apologize in advance if it's bumpy or incomplete.

The whosonfirst-data repository is the obvious exception (or perfect example, depending on how you look at it) to the scenario described above. This repository contains all administrative placetypes ( all the places between and inclusive of continents to microhoods ) for the entire world. While it is possible to imagine that the sum total of all the neighbourhoods in the world will require putting them in a separate repository but we are going to hold off doing that for as long as we can.

Who's On First records should always have a wof:repo property indicating the repository to which they belong. If they don't that's a bug.

Git and Large Files

We have started using git-lfs for managing large files in the whosonfirst-data repository. For example, the record for New Zealand which contains a very very very very very detailed coastline geometry exceeds the 100MB filesize limit for any individual file on GitHub.

A full discussion of how to use git-lfs is outside the scope of this document but you can see the current list of files being managed by invoking the git lfs ls-files command, like this:


$> cd /usr/local/mapzen/whosonfirst-data
$> git lfs ls-files
65ccc4825e * data/856/333/45/85633345.geojson

When you clone the whosonfirst-data repo the files (managed by git-lfs) only contain metadata, like this:


$> cat data/856/333/45/85633345.geojson
version https://git-lfs.github.com/spec/v1
oid sha256:65ccc4825e65c30f00fcebf1f3d57f4385f18a47e3c5e524114a67050186ae48
size 71879893

In order to retrieve the contents of the file itself you will need to run git lfs pull , like this:


$> git lfs pull
Fetching master
(1 of 1 files) 68.54 MB / 68.55 MB
$> cat data/856/333/45/85633345.geojson
{
"id": 85633345,
"type": "Feature",
"properties": {
"edtf:cessation":"u",
"edtf:inception":"u",
"geom:area":29.187792061074827,
"geom:bbox":"166.426148,-47.289992,178.577244,-33

Depending on when you read this we may have already pre-emptively moved all the records for countries in to git-lfs . If the goal is to have details ground truth geometries for every place in the Who's On First gazetteer it stands to reason that most if not all countries will bump up against GitHub's existing file size limits.