State of the Gazetteer in 2023 This is a blog post by nvkelso that was published on Jun 07, 2023 and tagged whosonfirst, wof, data and analysis
The Who’s On First (WOF) gazetteer is a big list of places, each with a stable identifier, and a set of common and optional descriptive properties about that location. Like any product, Who’s On First is made by people, and we’ve collectively spent about “20 years” building WOF as an original work and as an aggregation of many other open data projects.
Since we launched in 2015, the project has grown in coverage, complexity, and supported applications. In this this post I will summarize Who’s On First’s key advantages, offer a comparative analysis of WOF and other open gazetteers, quantify our global coverage by placetype, offer score cards by country, dive into name localization, look at internationalization through the lens of disputed territories, and quantify geometry types and sources of those polygon and points, hold hands with and thank our sources, and invite collaboration.
WOF gazetteer’s focus on localities and unique identifiers sets us apart from many other projects. We choose to approach geography at a personal level – starting with the locality
(or populated place) where people play, study, work, and sleep – and country
for when people need a passport to cross a frontier and experience something new and exciting.
Between those “top” and “bottom” placetypes in the administrative hierarchy, countries choose to order their geography in many varied and nested subdivisions. Who’s On First doesn’t judge, but we do align them into common region
(1st order subdivisions, tracked by the ISO), “county” (2nd order subdivisions), and localadmin
(municipalities) placetypes. Most localities aren’t legally incorporated, and may or may not be coincidental with a parent localadmin. Many urban localities are further subdivided into borough
and neighbourhood
areas. There are even more optional placetypes to allow for intermediate levels and other edge cases.
This common structure allows us to track all the places in the world. Or at least those places we’ve learned about thus far. We hope it provides you an easier mental model to integrate the data into your own application.
Eager to play with the data yourself? Jump to the downloads section below…
Who’s On First gazetteer at a glance
- A single source of truth across a minimalist common set of placetypes, spanning the entire globe, and supplemented as-needed on a regional level
- Global coverage for
5M
(million) administrative places, including4.5M
localities - More than
25M
additional places are available, including:3.9M
postal codes,7K
(thousand) constituencies, and21M
venues 452K
features have detailed polygons, with complete coverage at country, region, and county placetypes and exhaustive coverage for localadmin and large localities in major industrialized economies, with exhaustive coverage as point geometries for remaining features. We also store447K
alternate geometries.- Names are localized into hundreds of languages, using internet standard language codes
- Population values,
min_zoom
ranking, and polygon label centroids enable sophisticated map designs with beautiful and informative cartography - Internationalized with disputed territory boundaries for global audiences and compliance with local regulations
- Open license (CC-BY attribution) means you’re free to use the data for commercial purposes, including derivative rights, as long as you credit the project and our sources
- Includes original work and data aggregated from 360+ authoritative sources. Our source data licenses have been vetted by multiple corporate attorneys at large, publicly traded companies.
- Holds hands with 60+ datasets via linked unique identifiers to allow crosswalk across
6.5M
feature-level concordances. - Spelunker web app for viewing formatted data with a map (browse New York city record)
- Write Field web editor for quick property edits (edit New York city record) with human review
- View raw data on Github (view New York city example)
- Downloadable SQLite and Shapefile distributions from geocode.earth (thanks!)
- Collaborative project started in 2015 and remains active, including contributions from Snapchat
- The Who’s On First project is hosted by the Linux Foundation
Comparative Analysis of Open Gazetteers
We recognize our WOF gazetteer is one of many open gazetteers, let’s look at five, including Who’s On First:
- Who’s On First: Open data project with both administrative and non-administrative data like postal codes, constituencies, and venues. The CC-BY license allows commercial use. Includes polygons, fully localized names, disputed territory handling for reverse geocoding, navigation centroids, and unique IDs to capture value from metrics. The data is used in applications with over 300 million monthly active users.
- GeoNames: Commercial project across a range of administrative and non-administrative placetypes including postal codes and venues. Premium polygons can be licensed for a fee. Great locality coverage!
- geoBoundaries: Academic project focused exclusively on administrative hierarchy. A complicated license that mixes in ODbL and other non-commercial data sources with CC-BY data. Names are not localized, no disputed territories, no label centroids, and no unique IDs.
- GADM: Academic project focused exclusively on administrative hierarchy. Non-commercial use license although a custom commercial license is available on request. Names include latin, variants, and local script but not in a machine readable way. No disputed territories, no label centroids, and no unique IDs.
- All The Places: Open data project with CC-0 license focused exclusively on venues.
LEGEND for tables below
- ✅ Great coverage and usability
- ☑️ Basic coverage or usability
- ❌ No coverage or usability (unless noted)
- Unique IDs must also be stable
- L10n = Localized names
- i18n = Internationalized features
- Links = Concordance 🔗 with other gazetteer projects
Gazetteer data competitive analysis table
Project | Started | License | Admin records | Locality subtotal | Others | Unique IDs | L10n | i18n | Admin polygons | Admin points | Links |
---|---|---|---|---|---|---|---|---|---|---|---|
Who’s On First | 2015 | CC-BY | 5.0M | 4.5M | 25.0M | ✅ | ✅ | ️️️☑️ | ✅ 0.4M | ✅ 4.6M | ✅ |
GeoNames | 2005 | CC-BY | 5.1M | 4.6M | 5.8M | ✅ | ✅ | ❌ | ❌ $$$ 💰 0.5M | ✅ 4.6M | ✅ |
geoBoundaries | 2017 | * ODbL |
1.0M | 0.0M | 0.0M | ❌ | ❌ | ❌ | ✅ 1.0M | ❌ | ❌ |
GADM | 2009 | CC-BY-NC | 0.4M | 0.0M | 0.0M | ❌ | ☑️ | ❌ | ✅ 0.4M | ❌ | ❌ |
All The Places | 2017 | CC-0 | 0.0M | 0.0M | 2.5M | ☑️ | ❌ | ❌ | ❌ | ❌ | ☑️ |
*
WARNING: geoBoundaries contains multiple sources, many of which are ODbL and some of which are CC-BY. Discretion is advised!
DISCLAIMER: The above table represents a good faith effort to compare open gazetteer projects with nominally CC-BY and more permissive licenses in May 2023. Each project has its own motivations, use cases, and update frequencies. Major respect for everyone involved (and to OpenStreetMap and their ODbL effort).
Gazetteer applications competitive analysis table
Project | Search | Reverse Geocoding | Map Display | Routing | Metrics logging | Venues | Postcodes | Constituencies | Tooling |
---|---|---|---|---|---|---|---|---|---|
Who’s On First | ✅ | ✅ | ✅ | ✅ | ✅ | ☑ ️21M | ✅ 3.9M | ☑ ️7K | Web, CLI & API * |
GeoNames | ✅ | ❌ $$$ | ☑️ | ✅ | ✅ | ☑️ 1.1M | ✅ 1.5M | ❌ | Web & API |
geoBoundaries | ☑️ | ☑️ | ☑️ | ❌ | ❌ | ❌ | ❌ | ❌ | Web & API |
GADM | ☑️ | ☑️ | ☑️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
All The Places | ✅ | ❌ | ✅ | ✅ | ☑️ | ✅ 2.5M | ❌ | ❌ | Web |
*
While Who’s On First does includes an optional API it is not currently hosted online.
NOTE: geoBoundaries has a great web viewer to compare data for countries and their subdivisions across projects – including those here in WOF and from the lower geometry precision Natural Earth.
Who’s On First global coverage deep dive
Who’s On First has 5M
administrative places, including 4.5M
localities. Detailed breakdowns by placetype, geometry type, country & etc are available farther down in the “By Geometry Type” section.
placetype | count |
---|---|
locality | 4,498,136 |
neighbourhood | 233,712 |
localadmin | 203,513 |
county | 47,431 |
region | 5,139 |
country | 232 |
disputed | 104 |
dependency | 43 |
(other admin) | 29,934 |
TOTAL | 5,018,244 |
Additional coverage is available in the WOF gazetteer for 3.9M
postal codes, 7K
constituencies, and 21M
venues, but they are largely excluded from the “administrative” tables below, although descriptions and coverage maps are provided.
Coverage by placetype
Locality coverage
Localities form the bulk of place records in the WOF gazetteer. Of those 4.5
million records, 95% have point geometries and 5% have polygons. Distribution of our localities follows population density. See the “Scorecards” section below for a description of which countries have locality polygon coverage.
Primary administrative hierarchy coverage
Gazetteer projects must decide how to set up 1st order administrative divisions around the world, and this decision has repercussions in calculating the depth of administrative levels in a country.
Take for example the United Kingdom of Great Britain and Northern Ireland. Are the 4 “constituent countries” of the U.K. the 1st order subdivisions or are their 334 subdivisions? Are the 101 provinces of France the 1st order subdivisions, or the 29 regions? Are the “overseas” regions and provinces of France part of the main country or different countries? What about the United Kingdom, France, and other country’s dependencies? What about Belgium and its federal system? Or the United States and it’s dependencies?
Who’s On First tracks countries
and dependencies
as equal top-level placetypes and group them into empires
as appropriate. WOF chooses to represent all ISO 1st order subdivisions as “region” placetype features in a continuous fabric of global coverage. When the ISO also provides groupings of those, as reported by the country itself for administrative (versus statistical) use, we import those as an optional macroregion
with complete in-country coverage but sparse global coverage. This situation (and others like in Belgium) impact over a dozen countries globally.
“What are…” in WOF: the United Kingdom has 4 macroregion
and 334 region
records; France has 29 macroregion
and 101 region
records; the United States has 51 region
records (including the District of Columbia which is marked in label:eng_x_preferred_placetype
as a “federal district”). Puerto Rico is a dependency of the United States empire. Naming placetypes in inherently controversial so we also provide an optional and localized “label:{lang}_x_preferred_placetype” property on records that carries the specific word used for that place in its country and translated to English (including the region of Riyadh in Arabic script, latinized Arabic, and English).
(below) Feature counts for each placetype are included in the legend. Every feature on the map below has a polygon geometry, but is represented as a point for visualization purposes.
Localadmin coverage
Most of the 230K
localadmin placetype features have polygon geometries. See the “Depth of administrative levels”, “By geometry type”, and “Improving WOF data coverage and quality (a more detailed look)” sections below for more information.
Campus coverage
Most of the 24K
campus placetype features are airports, but we include national park polygons in select countries, and in the USA there are ~ 20K
mobile home parks.
Other core “administrative” placetypes coverage
Additional placetypes in the “admin” repos are included for convenience to name and/or reverse geocode for continents, oceans, marine areas, time zones, and more exotic features like empires and market areas. We’re also wondering what those 2 unknown features are!
Postalcode coverage
Who’s On First has 3.4M
postalcodes with a range spatial aggregations, geometry types, and polygon accuracies:
- Polygons for fully qualified postalcodes (e.g. “95501” in the United States sourced from the US Census Zip code tabulation areas, and also available in Australia, Australia, Finland, France, and Switzerland)
- Polygons for postal aggregation areas (Canada, Netherlands). Think “955” in the United States which is useful for calculating regional statistics.
- Point locations for specific delivery routes, like “SW1A 1AA” in the United Kingdom (think ZIP+4 in the United States) which have a precise location (the same situation is true for Japan) or an approximate location mapped to the nearest available polygon centroid (like in Canada which has a combination of precise and official 3-character “forward sorting area” aggregation polygons and approximate centroids for the 6-character postal codes).
- In the Netherlands we also have a 4th type of polygon created from the alphashape of the postalcode attributes on WOF venue point locations.
The majority of WOF postalcode records exist in other countries – but their locations are approximate (or are visiting Null Island) and are not shown in the map below based on the shapefile distribution – but they are included in the SQLite distribution. If you’re interested in mapping postalcodes to the post office / locality names, we provide that for the United States in the mz:postal_locality
property. If you need global mapping, see Pelias’ postal cities project.
NOTE: The United Kingdom and Australia receive regular updates, the rest are an older vintage and could use a refresh.
Several detailed map views are provided farther below for the more unusual cases.
Vancouver, Canada postalcodes
When precise locations for postalcodes (like the “V5K 1A2” delivery route) are not available they are sometimes mapped to the polygon centroids of their parent postal area (e.g. “V5K” postal aggregation area) in Vancouver, Canada shown below.
London postalcodes
Detailed point locations for specific delivery routes like “SW1A 1AA” are available in the United Kingdom, updated approximately quarterly, with the London metro area shown below.
Tokyo, Japan postalcodes
Similar point geometries are available in Japan, with the Tokyo metro area shown below.
Amsterdam, Netherlands postalcodes
Several experiments have created alphashapes from WOF venues address data, including in the Netherlands. These approximate polygons are marked “mz:is_approximate” and capture the bounding box extent of the zipcode and the label centroid in search and even reverse geocoding, but are not suitable for display in a thematic map.
Constituency coverage
For national and regional legislative bodies, including upper and lower chamber indication via “wof:association”. These will need a refresh for the 2023 redistricting in the United States. Canada includes data in British Columbia only.
Venue coverage
Back in 2015 and 2016 we imported 21 million venues in 63 countries from the SimpleGeo CC0 venues dataset. We added place hierarchy by reverse geocoding these points against the WOF administrative polygons, and otherwise normalized them into the WOF schema. But venues come and go – they are more ephemeral than administrative data – especially the last few years during the COVID pandemic. So the WOF venue coverage will lack newer businesses and still include businesses that have closed IRL. A separate analysis of WOF venues is available in an earlier blog post.
By Geometry Type
Nearly 10% of Who’s On First’s 5M
administrative records have polygons, with complete global coverage at the country
and dependency
(collectively the “top administrative level”), and region
(1st order administrative subdivision) levels. Nearly continuous global coverage is provided at the county
(2nd order administrative subdivision) level. Many countries also include complete polygon coverage at the localadmin
level.
Countries with any coloring in the “Country scorecards” map in the next section include complete locality polygon coverage. Other “gray” colored countries in that map will only include approximate locality polygons for the country capital and a handful of other regionally important localities.
The WOF gazetteer also stores nearly 0.5M
alternate geometries. This can occur when we upgrade point geometries to polygons, have multiple polygons to choose from but we designate just one the default and store the others as alternates, deal with coastline clipping for display versus territorial waters for reverse geocoding, or need to manually curate a custom polygon (and store the original as an alternate geometry).
Every default polygon feature includes a label centroid geometry suitable for beautiful cartographic labeling derived either from MapShaper or manual curation. Sometimes navigation, reverse geocoding, and other specialized centroids are also provided.
(below) The count of WOF features grouped by placetype, how many of those have default point or polygon geometry, how many have alternate geometries (and the ratio of alternate geometries to the count of features in that placetype), a point area indicates an WOF error mostly from us add or remove neighbourhood polygon geometries, the polygon area in equal area square kilometers, and the ratio of that area compared with the scope of the globe the placetype should cover.
Order | Placetype | Total Features | Points | Multi + Polygons | Alt geoms | Alt % | Point Areas | Poly Area SQ KM | Area % | Area Denom | Area Scope |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | country | 232 | 5 | 227 | 670 | 288.80% | 0.0K | 147.0M | 99.90% | 147.1M | all country and dependency |
2 | dependency | 43 | 43 | 52 | 120.90% | 83.3K | 0.10% | 147.1M | all country and dependency | ||
3 | disputed | 104 | 104 | 110 | 105.80% | 1.2M | 0.80% | 147.1M | all country and dependency | ||
4 | macroregion | 117 | 117 | 1 | 0.90% | 3.3M | 29.80% | 10.9M | countries in (‘AU’, ‘DE’, ‘FR’, ‘GF’, ‘TW’) | ||
5 | region | 5,139 | 169 | 4,970 | 2,404 | 46.80% | 0.0K | 150.0M | 102.00% | 147.1M | all country and dependency |
6 | macrocounty | 482 | 482 | 21 | 4.40% | 882.5K | 10.00% | 8.8M | countries in (‘AZ’, ‘BE’, ‘BR’, ‘ES’, ‘FI’, ‘FR’, ‘GB’, ‘HK’, ‘HU’, ‘IT’, ‘RS’, ‘TW’) | ||
7 | county | 47,431 | 2,412 | 45,019 | 7,292 | 15.40% | 0.0K | 125.7M | 85.50% | 147.1M | all country and dependency |
8 | localadmin | 203,513 | 73,198 | 130,315 | 107,619 | 52.90% | 0.0K | 18.1M | 12.30% | 147.1M | all country and dependency |
9 | locality | 4,498,136 | 4,274,056 | 224,080 | 200,106 | 4.40% | 0.5K | 7.1M | 4.80% | 147.1M | all country and dependency |
10 | borough | 467 | 18 | 449 | 9 | 1.90% | 0.0K | 16.0K | 0.20% | 7.1M | locality |
11 | macrohood | 1,272 | 4 | 1,268 | 2 | 0.20% | 0.0K | 15.9K | 0.20% | 7.1M | locality |
12 | neighbourhood | 233,712 | 194,713 | 38,999 | 128,733 | 55.10% | 94.8K | 144.3K | 2.00% | 7.1M | locality |
13 | microhood | 2,127 | 165 | 1,962 | 2 | 0.10% | 0.0K | 1.9K | 0.00% | 7.1M | locality |
14 | planet | 1 | 1 | 510.1M | 100.00% | 510.1M | planet | ||||
16 | ocean | 7 | 7 | 293.1M | 57.50% | 510.1M | planet | ||||
15 | continent | 8 | 8 | 5 | 62.50% | 146.7M | 28.80% | 510.1M | planet | ||
17 | marinearea | 402 | 402 | 3 | 0.70% | 72.7M | 24.80% | 293.1M | ocean | ||
18 | empire | 12 | 12 | 47.2M | 32.10% | 147.1M | all country and dependency | ||||
19 | timezone | 376 | 376 | 134.4M | 91.40% | 147.1M | all country and dependency | ||||
20 | marketarea | 210 | 210 | 8.5M | 86.30% | 9.8M | countries in (‘US’) | ||||
21 | campus | 24,452 | 21,916 | 2,536 | 33 | 0.10% | 0.0K | 346.8K | 0.20% | 147.1M | all country and dependency |
22 | unknown | 1 | 1 | 1 | 100.00% | 0.00% | 510.1M | planet | |||
23 | TOTALS | 5,018,244 | 4,566,656 | 451,588 | 447,063 | 8.90% | 95.3K | 1.7B |
NOTE: Postalcodes are not tracked in this table. See Postalcode coverage section above.
By Geometry Source
Geometries directly imported by Who’s On First from source
Who’s On First both makes our own data and imports data directly from governments and other reputable sources. WOF tracks sources for each geometry in the “src:geom” property, with values being a source’s “name” in the sources repo, while imported property’s from that source will have their name prepended with the “prefix” in the source’s JSON file. Hyperlinks are provided in the table below directly to that source’s JSON file (while the sources link above is human readable formatted text).
directly by | source | count |
whosonfirst | geonames | 4192805 |
whosonfirst | qs_pg | 213815 |
whosonfirst | quattroshapes | 202739 |
whosonfirst | frgov | 70551 |
whosonfirst | unknown | 70371 |
whosonfirst | mz | 57396 |
whosonfirst | whosonfirst | 47435 |
whosonfirst | uscensus | 36473 |
whosonfirst | meso | 27675 |
whosonfirst | de-bkg | 16290 |
whosonfirst | cbsnl | 14243 |
whosonfirst | ro-ancpi | 13753 |
whosonfirst | ch-cadastre | 6397 |
whosonfirst | nz-linz | 5942 |
whosonfirst | ee-mamt | 5045 |
whosonfirst | pl-gugik | 4658 |
whosonfirst | zetashapes | 4372 |
whosonfirst | ourairports | 3326 |
whosonfirst | pt-dgt | 3092 |
whosonfirst | se-lant | 2813 |
whosonfirst | austriaod | 2194 |
whosonfirst | naturalearth | 1858 |
whosonfirst | wikidata | 1742 |
whosonfirst | woedb | 1472 |
whosonfirst | esp-cicgc | 950 |
whosonfirst | ie-gov | 816 |
whosonfirst | au-capad | 716 |
whosonfirst | amsgis | 695 |
whosonfirst | no-geonorge | 493 |
whosonfirst | us-nps | 419 |
whosonfirst | hkigis | 409 |
whosonfirst | sg-sggov | 396 |
whosonfirst | can-edmdsd | 333 |
whosonfirst | figov | 323 |
whosonfirst | begov | 308 |
whosonfirst | baltomoit | 288 |
whosonfirst | statcan | 272 |
whosonfirst | can-abog | 264 |
whosonfirst | pedia | 254 |
whosonfirst | atldpcd | 235 |
whosonfirst | svn-sma | 224 |
whosonfirst | lu-act | 216 |
whosonfirst | za-mdb | 213 |
whosonfirst | wapo | 213 |
whosonfirst | zolk | 207 |
whosonfirst | aus-psma | 193 |
whosonfirst | mx-conanp | 182 |
whosonfirst | dk-geodk | 179 |
whosonfirst | azavea | 158 |
whosonfirst | can-calcai | 152 |
whosonfirst | esp-aytomad | 151 |
whosonfirst | can-wpgppd | 148 |
whosonfirst | sdgis | 136 |
whosonfirst | tkugov | 129 |
whosonfirst | oakced | 124 |
whosonfirst | ssuberlin | 108 |
whosonfirst | tmpgov | 106 |
whosonfirst | oulugov | 106 |
whosonfirst | porbps | 102 |
whosonfirst | sfgov | 96 |
whosonfirst | os | 91 |
whosonfirst | seagv | 88 |
whosonfirst | denvercpd | 87 |
whosonfirst | torsdfa | 85 |
whosonfirst | nolagis | 83 |
whosonfirst | esp-cartobcn | 83 |
whosonfirst | can-gatsudd | 74 |
whosonfirst | fivgov | 70 |
whosonfirst | arg-caba | 65 |
whosonfirst | can-mtlsmvt | 61 |
whosonfirst | can-saskodp | 55 |
whosonfirst | kuogov | 47 |
whosonfirst | can-clab | 47 |
whosonfirst | lacity | 40 |
whosonfirst | can-dnvgov | 37 |
whosonfirst | can-ons | 36 |
whosonfirst | minitenders | 35 |
whosonfirst | gbr-datalondon | 33 |
whosonfirst | santabar | 32 |
whosonfirst | can-bbygov | 32 |
whosonfirst | eu-cdda | 27 |
whosonfirst | can-qcodp | 27 |
whosonfirst | bra | 27 |
whosonfirst | hk-gov | 21 |
whosonfirst | burritojustice | 19 |
whosonfirst | wikipedia | 16 |
whosonfirst | stpaulgov | 16 |
whosonfirst | wof | 15 |
whosonfirst | SIJ | 15 |
whosonfirst | ni-os | 12 |
whosonfirst | can-rodca | 12 |
whosonfirst | mapzen | 11 |
whosonfirst | ausstat | 11 |
whosonfirst | uk-datagov | 10 |
whosonfirst | canvec-hydro | 9 |
whosonfirst | can-nwds | 9 |
whosonfirst | vanpds | 7 |
whosonfirst | camgov | 7 |
whosonfirst | can-surgis | 5 |
whosonfirst | us-dshiu | 3 |
whosonfirst | sjp | 3 |
whosonfirst | sfomuseum | 3 |
whosonfirst | can-lvlsu | 3 |
whosonfirst | quattroshapes_pg | 2 |
whosonfirst | nullisland | 2 |
whosonfirst | simplegeo | 1 |
whosonfirst | ccsf-sfo | 1 |
Quattroshapes geometries imported to WOF from aggregator
These are the OG polygons that WOF launched with in 2015 thanks to Foursquare’s open data Quattroshapes gazetteer. They range across all administrative placetypes from country to locality to neighbourhood.
Who’s On First indicates if a geometry is from an aggregator in the main “src:geom” property, and lists out those sources the sources repo. Per record fidelity is available by dipping into the “qs:source” property; we plan to promote that information into a new and generic top-level WOF property.
source | src_via | count |
---|---|---|
quattroshapes | Australia Census | 558 |
quattroshapes | Australian Bureau Of Statistics | 8253 |
quattroshapes | Australian Bureau Of Statistics: state suburbs file selection | 4175 |
quattroshapes | Australian Bureau Of Statistics: state suburbs file with customization | 232 |
quattroshapes | Australian Bureau Of Statistics: urban areas and localities selection | 925 |
quattroshapes | Brasil IBGE | 35559 |
quattroshapes | Canada Census | 5771 |
quattroshapes | Custom EuroGlobalMap + UMZ Urban Polygons + Geonames + GeoPlanet | 30639 |
quattroshapes | EuroGlobalMap | 2000 |
quattroshapes | France IGN | 5502 |
quattroshapes | GlobalMap | 4970 |
quattroshapes | Italy IGN | 8223 |
quattroshapes | Mexico IGN | 7028 |
quattroshapes | Natural Earth | 1741 |
quattroshapes | NBC Gov | 152 |
quattroshapes | New Zealand LINZ | 19 |
quattroshapes | NLD Kadaster | 119 |
quattroshapes | quattroshapes | 4824 |
quattroshapes | Spain IGN | 7318 |
quattroshapes | SwissTopo | 604 |
quattroshapes | UK OS | 14550 |
quattroshapes | Unknown (multiple sources) | 8378 |
quattroshapes | US Census | 20089 |
quattroshapes | US State Department, with Natural Earth mods | 171 |
Mesoshapes geometries imported to WOF from aggregator
These are mostly county
placetype default geometries.
Who’s On First indicates if a geometry is from an aggregator in the main “src:geom” property, and lists out those sources the sources repo. Per record fidelity is available by dipping into the “meso:source” property; we plan to promote that information into a new and generic top-level WOF property.
source | src_via | via_name | count |
---|---|---|---|
meso | ACE | Armenia: Acopian Center for the Environment | 913 |
meso | AOTM | Global: Art of the Mappable (AOTM) for Mapzen | 14785 |
meso | CLNC | Chilean National Library of Congress | 51 |
meso | DATAMEET | India: DataMeet | 607 |
meso | DGEEC | Paraguay: DGEEC | 246 |
meso | EGM | Europe: Euro Global Map (EGM) | 155 |
meso | ESOC | Global: Empirical Study of Conflict (ESOC) | 109 |
meso | ESRI_OPEN | Global and Costa Rica: ESRI Open Data (Daticos) | 81 |
meso | FIGN | France: French IGN (FIGN) | 22 |
meso | GSIJ | Japan: Geospatial Information Authority of Japan (GSI) | 3063 |
meso | IDE_EFP | Bolivia: IDE-EPB GeoBolivia data portal (IDE_EPB) | 198 |
meso | ISCGM | Global, Azerbaijan, Georgia, Honduras, Indonesia, Mauritius, Mozambique, Nicaragua, Papua, Togo: International Steering Committee for Global Mapping (ISCGM) | 854 |
meso | meso | Mapzen | 6 |
meso | mz | Mapzen | 22 |
meso | NISR | Rwanda: National Institute of Statistics of Rwanda (NISR) | 30 |
meso | NREL | Bangladesh: National Renewable Energy Laboratory (NREL) | 64 |
meso | OCHA_OPEN | Global: OCHA_OPEN | 1608 |
meso | SAMDB | South Africa: South Africa Municipal Demarcation Board (SAMDB) | 52 |
meso | SH | Colombia: Sala Humanitaria (SH) | 1122 |
meso | SNZ | New Zealand: Statistics New Zealand (SNZ) | 67 |
meso | STATOIDS | Global: Statoids | 1 |
meso | TNBS | Tanzania: Tanzania National Bureau of Statistics (TNBS) | 169 |
meso | TNLSMC | Asia: National Land Surveying and Mapping Center (TNLSMC) | 30 |
meso | USAID | Global: USAID - GIST with USAID as originator, Egypt: USAID via FOIA, Ethiopia, Liberia: USAID | 445 |
meso | WRI | Global: World Resource Institute (WRI), Cameroon, Ethiopia: WRI | 103 |
meso | ZIMSTAT | Zimbabwe: Zimbabwe Central Statistics Office (ZimStat) | 44 |
meso | Multiple sources | 2828 |
WOF Gazetteer Data Scorecards
Country scorecards
We rate 87 countries with good to great polygon coverage at both locality and parent administrative subdivisions, including country, region (state/province), and county, in the A-AA-AAA-AAAA range. Our disputed
territory polygon coverage also falls in this “great” data grouping.
We rate 177 countries that are mostly limited to point locality coverage and basic polygon administrative subdivisions in the B-BB-BBB-BBBB range. We almost always have “complete” coverage in these countries, however, often tens of thousands and sometimes with hundreds of thousands of features are point geometries only.
Breakdowns of each of these classes are available below the map…
A range coverage by country
AAAA: Starting in 2017, we have rebuilt priority countries by importing authoritative, polygon-based records at all placetypes from region down to locality to provide AAAA data quality and coverage in: Australia, Austria, Canada, Estonia, France, Germany, Hong Kong, Italy, Netherlands, New Zealand, Poland, Romania, Singapore, Spain, Switzerland, United Kingdom, and United States.
AAA: Other countries in North America, Europe, Middle East, Asia, and elsewhere have had selective rebuilds at a mix of placetypes to provide AAA data quality and coverage using national mapping agency data – though possibly of an older vintage that may not reflect all recent administrative updates. These include: Akrotiri Sovereign Base Area, Aland, Albania, Bahrain, Belgium, Brazil, Bulgaria, Croatia, Cyprus, Czechia, Denmark, Dhekelia Sovereign Base Area, Faroe Islands, Finland, Greece, Greenland, Guernsey, Hungary, Iceland, Ireland, Isle of Man, Kuwait, Latvia, Liechtenstein, Lithuania, Luxembourg, Moldova, North Macedonia, Northern Cyprus, Norway, Portugal, Puerto Rico, Qatar, San Marino, Saudi Arabia, Slovakia, Slovenia, South Africa, Sweden, Turkey, Ukraine, United Arab Emirates, and Vatican.
AA: WOF provides AA data quality with extensive (but not universal) polygon coverage at locality level in Chile, India, Indonesia, Iraq, Israel, Mexico, Morocco, Nigeria, Pakistan, Palestine, Philipeans, South Korea, and Taiwan. These polygons often date back to 2013 and represent the largest urban settlements.
A: WOF provides A data quality with approximate locality polygons in China, Japan, Malaysia, Russia, and Thailand. Several countries with extensive and higher quality point locality coverage, including: Columbia and Honduras.
B range coverage by country
BBBB: WOF provides BBBB quality data with dense point locality coverage in Afghanistan, Argentina, Bangladesh, Baykonur, Bolivia, Iran, Kazakhstan, Mozambique, Myanmar, Nepal, Paraguay, Peru, Tanzania, Venezuela, and Vietnam.
BBB: WOF provides BBB quality data with dense point locality coverage in Angola, Belarus, Bosnia and Herzegovina, Congo (DRC), Costa Rica, Gabon, Madagascar, Mongolia, North Korea, Rwanda, Uganda, and Yemen.
BB: WOF provides BB quality date with medium density point locality coverage in Armenia, Burkina Faso, Burundi, Cambodia, Cameroon, Central African Republic, Congo (ROC), Cuba, Dominican Republic, Ecuador, Egypt, Ethiopia, Ghana, Guatemala, Haiti, Ivory Coast, Kosovo, Kyrgyzstan, Laos, Mali, Montenegro, Namibia, Panama, Papua New Guinea, Serbia, Sierra Leone, South Sudan, Sri Lanka, Sudan, Syria, Tunisia, Uruguay, and Zambia.
B: WOF provides B quality date with basic density point locality coverage in_ Algeria, Azerbaijan, Belize, Benin, Bhutan, Botswana, Chad, Djibouti, El Salvador, Equatorial Guinea, Eritrea, eSwatini, Gambia, Georgia, Guinea, Guinea-Bissau, Guyana, Jamaica, Jordan, Kenya, Lesotho, Liberia, Libya, Malawi, Mauritania, Nicaragua, Niger, Oman, Senegal, Somalia, Somaliland, Suriname, Tajikistan, Togo, Turkmenistan, Uzbekistan, and Zimbabwe. This category also includes several smaller island countries._
NOTE: We still provide detailed region and county polygon coverage for all these to generate full administrative hierarchies, except counties in Namibia, Botswana, Belarus, Bulgaria, Turkmenistan, and Uzbekistan. Greenland, Iceland, and Western Sahara do not have county subdivisions IRL.
Data size by country
NOTE: Methodology: https://data.geocode.earth/wof/dist/shapefile/inventory.json to get sizes per country.
Even though most countries are limited to locality points past the county polygon level, those features (and their properties including name localizations) can add up to sizable downloads, especially when mixed in with the polygon features.
Features by country
Looking at “administrative” placetype features by country (country code), the trend mostly follows population density.
rank | country | count | rank | country | count | rank | country | count | ||
---|---|---|---|---|---|---|---|---|---|---|
1 | China (CN) | 679,207 | 21 | Bangladesh (BD) | 48,696 | 41 | Austria (AT) | 23,458 | ||
2 | India (IN) | 509,567 | 22 | Philippines (PH) | 45,819 | 42 | Iraq (IQ) | 22,900 | ||
3 | United States (US) | 293,571 | 23 | Morocco (MA) | 45,577 | 43 | Mozambique (MZ) | 22,782 | ||
4 | Indonesia (ID) | 256,068 | 24 | Viet Nam (VN) | 41,889 | 44 | Bosnia & Herz. (BA) | 22,147 | ||
5 | Mexico (MX) | 232,389 | 25 | Spain (ES) | 41,443 | 45 | Netherlands (NL) | 21,912 | ||
6 | Russia (RU) | 203,058 | 26 | Peru (PE) | 40,653 | 46 | Malaysia (MY) | 21,102 | ||
7 | Germany (DE) | 167,516 | 27 | Myanmar (MM) | 39,287 | 47 | Portugal (PT) | 20,519 | ||
8 | Pakistan (PK) | 135,126 | 28 | Yemen (YE) | 39,204 | 48 | Lithuania (LT) | 19,988 | ||
9 | France (FR) | 122,484 | 29 | Congo (CD) | 36,677 | 49 | Taiwan (TW) | 18,687 | ||
10 | Thailand (TH) | 87,191 | 30 | Ukraine (UA) | 35,145 | 50 | Sri Lanka (LK) | 17,956 | ||
11 | Nepal (NP) | 80,713 | 31 | Colombia (CO) | 33,723 | 51 | Czechia (CZ) | 17,341 | ||
12 | Iran (IR) | 78,355 | 32 | Afghanistan (AF) | 32,419 | 52 | Romania (RO) | 16,666 | ||
13 | Brazil (BR) | 76,335 | 33 | Sweden (SE) | 31,859 | 53 | Angola (AO) | 16,585 | ||
14 | Italy (IT) | 73,273 | 34 | Korea (DPR) (KP) | 31,755 | 54 | Greece (GR) | 15,855 | ||
15 | Japan (JP) | 56,979 | 35 | Australia (AU) | 28,204 | 55 | Bolivia (BO) | 15,495 | ||
16 | Nigeria (NG) | 55,893 | 36 | Canada (CA) | 28,097 | 56 | South Africa (ZA) | 15,402 | ||
17 | Türkiye (TR) | 54,420 | 37 | Switzerland (CH) | 25,691 | 57 | Laos (LA) | 15,262 | ||
18 | Korea (KR) | 53,030 | 38 | Belarus (BY) | 25,276 | 58 | Finland (FI) | 15,143 | ||
19 | United Kingdom (GB) | 52,950 | 39 | Madagascar (MG) | 24,260 | Others | 659,904 | |||
20 | Poland (PL) | 51,522 | 40 | Venezuela (VE) | 23,819 | TOTAL | 5,018,244 |
Depth of administrative levels by country
Who’s On First tracks 1st and 2nd order administrative divisions for most countries and dependencies (so three levels deep including country
(and/or dependency
and empire
), region
, and county
). A handful only have 1st order divisions (2 levels). Around 39 countries include 4 to 7 administrative levels. Around 24 of those include features at the municipality (localadmin
) level. We have point records only for localadmin
in another 19 countries, and anticipate polygon features for this placetype can be imported for another 53 countries (see “Future Directions” section below).
NOTE: All the countries in the map above also include locality
points and polygons but because those don’t form a continuous in-country or global fabric this level is excluded from the counts above.
Vintage of data by country
While 2013-era Quattroshapes is the skeleton for Who’s On First, we’ve substantially updated the gazetteer in the last decade. In 2015, we imported the original Quattroshapes polygon features to normalize them and otherwise clean them up. In 2016, we added polygons for 2nd order administrative divisions (“county
”) for almost every country and dependency in the world. In 2017, we imported Quattroshapes point gazetteer and GeoNames “locality
” points to provide global coverage at that placetype in all countries. Most country and dependency capitals and major metropolitan centers were updated with approximate locality
polygons in 2021. Select countries have been rebuilt in whole or partially since 2017, including 42 countries in North America, Europe, and elsewhere.
Most of the AAAA and AAA countries (see section above) have a 5- or 10-year census, and we attempt to incorporate polygon data from their most recent census &/or INSPIRE-related open data efforts. Some eastern European countries have earlier 2013 vintage locality polygons. Generally region and county level changes less per year, with the localadmin level experiencing more consolidation and occasional splits on an annual basis (~1% change per year).
When Who’s On First observes a change in a country’s administrative subdivisions we add the new or changed features with an inception date, and mark the old ones with a cessation date using the expressive U.S. Library of Congress’ Extended Date/Time Format (EDTF) syntax. When possible, we also point back and forth between new and old with supersedes and supersedes_by properties. This allows older administrative units to be queried within a date range in the SQLite distribution (while the Shapefile distribution excludes non-current records). We’ve also experimented with the same for countries, like the former Yugoslavia, which we discussed in an earlier blog post, and for the 2016 French “region” level consolidation (for example the WOF macroregion
of Rhone-Alpes was consolidated into the new Auvergne-Rhone-Alpes WOF macroregion
).
We catalog changes to our place data in the CHANGELOG by year, month, and country.
Select data properties
Population by placetype
By tracking population (and the related population_rank), similarly named places can be ranked and disambiguated in search and visually distinguished in map display. For example, there are many places named San Francisco in the world, but only one with an exact name match also with the largest population located in the California region of the United States.
For global coverage placetypes like continent, country (and dependency), region, county, and locality the population should eventually sum to the planet’s population – modulo import vintage and population growth and the “per global” provides an coverage &/or accuracy scoring. For other placetypes like macroregion and localadmin they should sum to the population in the covered countries (not included in the table). Other placetypes are not available globally so the “per global” indicates the global significance of that placetype when viewed by population instead of feature count.
(below) This chart sums available population values on WOF records grouped by placetype. Some placetypes that have global coverage should sum to the current “per global” population estimate of 7.888 billion people but currently may not because (a) the vintage of per feature population estimates either lags behind actual population growth or may over-estimate population e.g. because of disputed territories and (b) some features lack population entirely.
placetype | population | per global |
---|---|---|
borough | 29,474,816 | 0.40% |
campus | 133,650 | 0.00% |
continent | 7,627,184,440 | 96.70% |
country | 7,175,089,564 | 91.00% |
county | 4,934,055,579 | 62.60% |
dependency | 16,724,919 | 0.20% |
disputed | 461,098,876 | 5.80% |
empire | 2,009,661,518 | 25.50% |
localadmin | 963,706,749 | 12.20% |
locality | 5,717,437,257 | 72.50% |
macrocounty | 231,167 | 0.00% |
macrohood | 4,034,954 | 0.10% |
macroregion | 94,515,886 | 1.20% |
marinearea | 0.00% | |
marketarea | 0.00% | |
microhood | 51,741 | 0.00% |
neighbourhood | 411,640,484 | 5.20% |
ocean | 0.00% | |
planet | 0.00% | |
region | 9,300,891,860 | 117.90% |
timezone | 0.00% | |
unknown | 0.00% |
Zoom ranges by placetype
Some records should be displayed earlier on a map and some later, and not all features within the same placetype are equal. For example, region subdivisions in the United States are large geographically compared regions in Albania so the former should be shown earlier and the latter shown later. This is especially true of localities where most localities can be labeled on a zoom 12 map – but only a few shown on a zoom 2 map (especially considering name translations within a limited vector tile budget).
A few areas stand out for future work: county zooms should be spread out (mostly increasing them from zoom 6 to later zooms and as an offset for their parent region’s zoom), some localadmin are visible too early, some macrohood are visible too late, and some neighbourhood and microhood are visible too early – this can be mitigated in application logic when using the data but we should clean it up at the source.
(below) This chart counts available min_zoom
values on WOF records grouped by placetype. Course placetypes group features that are suitable for labeling on a low-zoom world map as “countries” or their 1st and 2nd order subdivisions, generally from zooms 0 thru 8 (though some counties should label later). Settlement placetypes include both locality and localadmin and are rarely labeled at low zooms, typically coming in the mid-zooms of 9, 10, 11, 12, and 13. Neighbourhood placetypes are subdivisions of settlements and generally should only be displayed at high-zooms of 11 thru 17.
Course placetypes | min_zoom |
count | . | Settlement placetypes | min_zoom |
count | . | Neighbourhood placetypes | min_zoom |
count |
---|---|---|---|---|---|---|---|---|---|---|
continent | 0 | 7 | localadmin | 3 | 5 | borough | 9 | 2 | ||
continent | 1 | 1 | localadmin | 4 | 50 | borough | 10 | 204 | ||
country | 0 | 4 | localadmin | 4.7 | 1 | borough | 11 | 63 | ||
country | 1.7 | 19 | localadmin | 5 | 60 | borough | 12 | 60 | ||
country | 2 | 26 | localadmin | 5.1 | 1 | borough | 13 | 39 | ||
country | 2.5 | 5 | localadmin | 5.6 | 3 | borough | 14 | 20 | ||
country | 3 | 54 | localadmin | 6 | 182 | borough | 15 | 3 | ||
country | 4 | 60 | localadmin | 6.7 | 13 | borough | 18 | 4 | ||
country | 4.5 | 11 | localadmin | 7 | 297 | borough | 0 | |||
country | 5 | 34 | localadmin | 8 | 681 | campus | 0 | 1 | ||
country | 6 | 1 | localadmin | 9 | 2054 | campus | 12 | 165 | ||
country | 0 | localadmin | 10 | 2302 | campus | 13 | 22705 | |||
county | 5.6 | 1 | localadmin | 11 | 26558 | campus | 14 | 1 | ||
county | 6 | 37206 | localadmin | 12 | 68889 | campus | 0 | |||
county | 6.1 | 1 | localadmin | 0 | macrohood | 11 | 9 | |||
county | 6.7 | 6 | locality | 1.7 | 8 | macrohood | 12 | 78 | ||
county | 7 | 9 | locality | 2 | 8 | macrohood | 13 | 970 | ||
county | 8 | 33 | locality | 2.1 | 14 | macrohood | 14 | 89 | ||
county | 8.7 | 5 | locality | 2.5 | 3 | macrohood | 15 | 46 | ||
county | 9 | 76 | locality | 2.7 | 3 | macrohood | 16 | 9 | ||
county | 10 | 112 | locality | 3 | 34 | macrohood | 17 | 1 | ||
county | 0 | locality | 3.7 | 30 | macrohood | 21 | 3 | |||
dependency | 0 | 4 | locality | 4 | 312 | macrohood | 0 | |||
dependency | 3 | 3 | locality | 4.7 | 27 | microhood | 11 | 1 | ||
dependency | 3.5 | 1 | locality | 5 | 450 | microhood | 12 | 14 | ||
dependency | 4 | 10 | locality | 5.1 | 165 | microhood | 13 | 81 | ||
dependency | 4.5 | 10 | locality | 5.6 | 400 | microhood | 14 | 255 | ||
dependency | 5 | 10 | locality | 5.7 | 2 | microhood | 15 | 604 | ||
dependency | 6.5 | 2 | locality | 6 | 1787 | microhood | 16 | 102 | ||
dependency | 7 | 2 | locality | 6.1 | 563 | microhood | 17 | 33 | ||
dependency | 0 | locality | 6.7 | 2044 | microhood | 18 | 25 | |||
disputed | 0 | 35 | locality | 7 | 2512 | microhood | 19 | 89 | ||
disputed | 3.7 | 7 | locality | 8 | 3294 | microhood | 0 | |||
disputed | 4 | 3 | locality | 9 | 12118 | neighbourhood | 0 | 70 | ||
disputed | 4.7 | 2 | locality | 10 | 10441 | neighbourhood | 2 | 2 | ||
disputed | 5 | 16 | locality | 11 | 295462 | neighbourhood | 2.5 | 1 | ||
disputed | 6 | 5 | locality | 12 | 3845767 | neighbourhood | 3 | 1 | ||
disputed | 6.7 | 1 | locality | 13 | 73875 | neighbourhood | 4 | 4 | ||
disputed | 7 | 34 | locality | 14 | 84 | neighbourhood | 5 | 3 | ||
disputed | 0 | locality | 15 | 110 | neighbourhood | 5.1 | 4 | |||
empire | 0 | locality | 16 | 9 | neighbourhood | 5.6 | 4 | |||
macrocounty | 5 | 148 | locality | 17 | 1 | neighbourhood | 6 | 6 | ||
macrocounty | 0 | locality | 20 | 792 | neighbourhood | 6.1 | 1 | |||
macroregion | 3 | 78 | locality | 30 | 18702 | neighbourhood | 6.7 | 11 | ||
macroregion | 11 | 3 | locality | 0 | neighbourhood | 7 | 15 | |||
macroregion | 0 | neighbourhood | 8 | 16 | ||||||
marinearea | 2 | 24 | neighbourhood | 9 | 79 | |||||
marinearea | 4 | 32 | neighbourhood | 10 | 43 | |||||
marinearea | 5 | 13 | neighbourhood | 11 | 741 | |||||
marinearea | 5.7 | 44 | neighbourhood | 12 | 14937 | |||||
marinearea | 6 | 74 | neighbourhood | 13 | 11515 | |||||
marinearea | 7 | 13 | neighbourhood | 14 | 21666 | |||||
marinearea | 7.6 | 59 | neighbourhood | 14.5 | 189 | |||||
marinearea | 7.7 | 36 | neighbourhood | 15 | 162012 | |||||
marinearea | 7.8 | 23 | neighbourhood | 15.2 | 776 | |||||
marinearea | 0 | neighbourhood | 15.5 | 350 | ||||||
marketarea | 0 | neighbourhood | 16 | 6556 | ||||||
ocean | 1 | 7 | neighbourhood | 16.5 | 761 | |||||
planet | 0 | neighbourhood | 17 | 1499 | ||||||
region | 2 | 72 | neighbourhood | 18 | 372 | |||||
region | 3 | 28 | neighbourhood | 19 | 279 | |||||
region | 4 | 121 | neighbourhood | 20 | 71 | |||||
region | 4.5 | 1 | neighbourhood | 21 | 17 | |||||
region | 4.6 | 112 | neighbourhood | 26 | 1 | |||||
region | 4.7 | 83 | neighbourhood | 0 | ||||||
region | 5 | 10 | ||||||||
region | 6 | 130 | ||||||||
region | 6.6 | 335 | ||||||||
region | 6.7 | 301 | ||||||||
region | 7 | 272 | ||||||||
region | 7.7 | 516 | ||||||||
region | 8 | 821 | ||||||||
region | 8.7 | 323 | ||||||||
region | 9 | 503 | ||||||||
region | 10 | 715 | ||||||||
region | 11 | 647 | ||||||||
region | 18 | 113 | ||||||||
region | 0 | |||||||||
timezone | 0 | |||||||||
unknown | 0 |
NOTE: The zoom ranges in the table above are for standard 256 px sized tiles, which map to the following natural scales and match Leaflet and Google Maps. When viewed in the Mapbox or MapLibre rendering library the zooms will be offset by 1 due to their assumption about 512 px size.
Improving WOF data coverage and quality
As human population grows (and declines) so the world’s heart beats, and so our work as gazetteer maintainers carries on – by both repainting and “touching up” the data.
Our top two priorities in 2023 are:
- Improve locality coverage in India, and
- expand localadmin coverage globally
We expect to land the India locality coverage by June this year with polygons for most localities and points for smaller rural villages.
Our localadmin efforts are largely focused on achieving complete coverage in Europe, introducing localadmin coverage in Africa, and expand localadmin coverage in Asia and South America.
Priority countries (blue outlines below) include: Indonesia, Mexico, Ireland, Portugal, Belgium, and Luxembourg. An additional 26 countries (black outlines below) have been prioritized for minor edits or more complete imports &/or rebuilds. Please reach out to help :)
Detailed plan for 2023
We have sourced for import in 2023 new data from national mapping agencies in 12 countries, potential NMA data in another 7 countries, 39 countries at the localadmin level via the geoBoundaries project, and need to review licenses on another 17 localadmin geoBoundaries countries. We’re still searching for data in several priority countries, including Costa Rica, Egypt, Peru, and Thailand.
For densely populated countries with complex administrative subdivisions (like in China and India), we may consider adding an optional microcounty
placetype between county and localadmin.
While we don’t have concrete plans for expanding postalcode
or constituency
coverage, if those are important to you please reach out. Two tractable problems are importing newer open polygon data for postalcode
features in Europe and rescuing point geometry postalcode
records that are visiting Null Island.
Localization
Who’s On First includes name localizations (L10n) into 494 languages. Individual language coverage ranges from rich to sparse. Major cities like New York are translated into most of the languages, but a rural locality is probably only provided in the local language and translated into English and/or the latin character set.
While the source GeoJSON and SQLite distribution include all languages, for Shapefiles we provide a more ergonomic experience by pre-joining the SPR to the names table to include the default and 25 more localized names, when available, for:
Arabic, Bengali, Chinese (simplified and/or traditional), Dutch, English, Farsi, French, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, Ukrainian, Urdu, and Vietnamese
Who’s On First uses the RFC 5646 / BCP-47 language indications for names to specify a 3-character code for the name translations, like name:{language}_x_preferred
in the GeoJSON and name_{locale}
in the Shapefiles. For example, English is stored as name:eng_x_preferred
in the GeoJSON and name_eng
in the Shapefiles.
The list of supported Shapefile languages is adapted from Natural Earth and Tilezen’s list of core languages. Arabic, Chinese, English, French, Russian and Spanish are used by the United Nations for meetings and official documents. The other languages listed are either proposed as an official language of the United Nations (Bengali, Hindi, Portuguese, and Turkish) or frequently used in OpenStreetMap, Who’s On First, or Wikipedia.
If you need to discern the “local” name for a given place record, the parent country
record indicates official and/or spoken language codes. In some countries the parent region
(or macroregion
) record will also indicate language preferences (like Catalonia in Spain), which can allow for local variations.
By current WOF convention, a feature’s default name is stored in ASCII-7 English. For latin script based languages you can maximize localized name coverage by coalescing name:{lang}_x_preferred
, name:eng_x_preferred
, and wof:name
. Specific coalesce logic depends on your application and the locale(s) used by your audience.
Localized name coverage by placetype
By convention, Who’s On First tracks a single “preferred” name spelling per language for a place. Because of the many different ways to translate and/or transliterate a place’s name from one language into another WOF also allows multiple “variant” names per language. Some languages have formalized rules for transliterating into another, but those rules can evolve over longer spans of time (e.g. romanization of Chinese) and generate time and system variants. Languages without formalized transliteration rules can organically generate even more variants.
The large variation in names is one of the reasons gazetteers like Who’s On First assign unique IDs for our features and, when we discern a match in another gazetteer, provide ID crosswalk between the “this is the same as that” features.
placetype | feature count | Preferred names in all locales | Preferred in Shapefile locales | Alternate names in all locales | Alternate names in Shapefile locales |
---|---|---|---|---|---|
borough | 467 | 5,897 | 3,225 | 881 | 847 |
campus | 24,452 | 106,282 | 75,138 | 39,862 | 39,066 |
continent | 8 | 1,887 | 270 | 91 | 22 |
country | 232 | 53,977 | 7,756 | 13,493 | 4,218 |
county | 47,431 | 1,046,479 | 481,357 | 125,178 | 108,200 |
dependency | 43 | 5,228 | 1,214 | 1,702 | 908 |
disputed | 104 | 6,088 | 2,379 | 914 | 604 |
empire | 12 | 2,370 | 237 | 12 | 12 |
localadmin | 203,513 | 2,170,698 | 1,187,815 | 259,808 | 243,505 |
locality | 4,498,136 | 22,396,047 | 12,908,288 | 6,869,767 | 5,000,119 |
macrocounty | 482 | 13,908 | 6,965 | 1,281 | 1,251 |
macrohood | 1,272 | 10,067 | 5,882 | 1,641 | 1,616 |
macroregion | 117 | 9,227 | 2,600 | 789 | 418 |
marinearea | 402 | 16,438 | 7,264 | 1,609 | 1,063 |
marketarea | 210 | 4,771 | 1,577 | 433 | 433 |
microhood | 2,127 | 9,316 | 6,669 | 2,910 | 2,859 |
neighbourhood | 233,712 | 1,482,405 | 852,647 | 318,119 | 288,618 |
ocean | 7 | 1,504 | 238 | 107 | 37 |
planet | 1 | 258 | 26 | 29 | 17 |
region | 5,139 | 269,007 | 109,135 | 33,508 | 23,478 |
timezone | 376 | 753 | 752 | 380 | 380 |
TOTALS | 5,018,243 | 27,612,607 | 15,661,434 | 7,672,514 | 5,717,671 |
Localized name coverage by language
The most common language in Who’s On First is English, followed by Chinese, and French. Rounding out the top 10 in ranked order are Russian, German, Dutch, Swedish, Spanish, Italian, and Polish.
The table below only includes the top 37 languages. A total of 63 languages have more than 100,000 preferred names. Another 45 languages have more than 50,000 preferred names.
NOTE: The default language for wof:name
is English, so almost all records include an implicit default English name and many also include an explicit localized English name. The Cebuano ceb
localizations are largely from a Wikipedia bot and are suspect so not included in the stats above. The Norwegian language uses multiple codes, including: nno
, nob
, and nor
.
Internationalization
For internationalization (i18n), we track 104 “disputed” territories (polygons between countries for contested areas) around the world, and Who’s On First hold hands with Natural Earth allowing you to make use of their extensive catalog of pairwise points-of-view.
Disputed placetype records include an optional mz:hieararchy_label
property (and related properties for each ancestor level in the placetype hierarchy) to indicate if place records that reverse geocode within the disputed area should include full, any, or partial text strings for that record’s ancestors. For example, Western Sahara which, assuming a reverse geocoder like Pelias has implemented the flag, won’t include text for country or region ancestors but will allow showing locality names. We recommend using the Natural Earth point-of-views to localize the hierarchy label properties (e.g. to allow viewers within Morocco to also see country and region ancestor details).
NOTE: The Hans Island disputed between Canada and Denmark has been resolved and needs to be updated. Huzzah diplomacy!
Open Data
Who’s On First’s permissive, attribution required (CC-BY equivalent) open license means you can use the data however you want (including for commercial purposes), as long as you provide credit somewhere in your map or app.
Crediting the Who’s On First project is required because some of the open data projects that WOF aggregates require attribution. Linking back to the License is required in hypermedia projects.
For example, on a website or in an app’s about page, you should also include the License hyperlink:
Data from Who’s On First. License.
For example, in a web map’s sources (displayed on the map), hyperlinked to the License:
Where the “License” text (or project name) hyperlinks either to the whosonfirst.org hosted license page, or to a detailed data licenses page (or section of your terms of service) on your web site or in your app. At that location, you must credit the Who’s On First sources that require attribution. It may be more practical to credit both Who’s On First and all of our sources – it’s the kind thing to do.
For stand alone print maps, a simple text credit to the project will suffice:
Who’s On First
If the map is included in a book or atlas publication then the data credits section should also repeat the Who’s On First credit and the web link (https://whosonfirst.org/docs/licenses/).
Sources
The Who’s On First dataset is both an original work and a modification of existing open data. The WOF gazetteer includes data from 360 sources. There are 139 primary sources, 188 additional sources via Quattroshapes, Mesoshapes, and other aggregators, and 33 concordance-only sources.
Sources include national mapping agencies in: United States (US Census), Australia (PSMA and ABS), Austria, Belgium (Geopunt), Brazil (IBGE via Quattroshapes), Canada (multiple, including Census Canada and Statistics Canada), Denmark, Estonia (Land Board), Finland (National Land Survey), France (IGN), Germany (BKG), Ireland (Ordnance Survey), Italy (IGN via Quattroshapes), Japan (GSI), Mexico (INEGI and IGN), Netherlands (CBS and NLD Kadaster), New Zealand (LINZ), Norway (Geonorge), Poland (GUGIK), Portugal (DG Territorio), Romania (ANCPI), Spain (IGN via Quattroshapes), Sweden (Lantmateriet), Singapore, Slovenia (EPG), South Africa (NBC and Municipal Demarcation Board), Switzerland (SwissTopo), and United Kingdom (Ordnance Survey).
Other global sources include: GeoNames, Yahoo’s GeoPlanet, Foursquare’s Quattroshapes, and Mapzen’s Mesoshapes. Several of our founders and core contributors were instrumental in these other projects.
We detail all sources and their specific license, usage, and vintage metadata in the full sources list.
Concordances
WOF includes a large number of concordances to allow our project to “hold hands” with other datasets – individual WOF records link to specific features in other open and closed gazetteer projects.
Not all sources include concordances, but when they do we indicate in the condances section the dataset prefix (like gn
for GeoNames and wd
for Wikidata) and the name of the unique identifier that project uses (mostly id
, so put together gn:id
and wd:id
fully qualified property names).
Who’s On First tracks our sources in the sources repo. Besides the concordances, WOF also notes the source for each geometry in the “src:geom” property (using the JSON “name” of the source) and imported properties from that source will have their name prepended with the “prefix” in the source’s JSON file. This is mostly machine readable today and we’re working to make it fully machine readable soon.
(below) Use the sources list to decode the source project prefixes used in the following table.
Applications
When the Who’s On First (WOF) gazetteer started in 2015 we partnered closely with Pelias, another Mapzen project, to ensure our big list of places could power:
- Search for forward geocoding based on full name string input and type-ahead “autocomplete” based on partial string matching per keystroke
- Hierarchy lookup allowing a place’s parent and other ancestors to be reverse geocoding from a given latitude, longitude input. Specific attention was paid to disputed territories and which hierarchy components should be masked in a reverse geocoding lookup.
- Single feature display with label centroids for polygons, label bounding boxes, detailed polygons.
- Metrics logging is possible because of WOF’s stable unique IDs – to turn data exhaust into insights about customer usage patterns.
The results of a search can then be pipped to Valhalla, another Mapzen project for:
- Routing with polygon navigation and label “centroids” that are close to a place’s urban center rather than the math centroid of the polygon, and guaranteed to be on land.
Places can also be added to vector tiles Tilezen, another Mapzen project for:
- Basemap with label “centroids” for neighbourhood polygons.
All these Mapzen projects (Who’s On First, Pelias, Valhalla, and Tilezen) are now part of the Linux Foundation and are free for anyone to use and adapt.
Future applications looking for collaboration and funding
Map display for label points
- Label min zoom and label max zooms exist but can to be fine tuned for global, multi-zoom basemaps (especially for county and locality placetypes)
- Population estimate coverage expansion to determine relative label zoom and townspot size grading (for latest census counts)
Map display for thematic polygons
- Min zoom and max zoom variable per placetype and per country with area grading for a composite global, multi-zoom basemap
- Vector tilesets (global and per placetype) with newer tools like Tippicanoe and Planetiler delivered as serverless PMTiles archives
- Edge matchingof boundaries between countries
- Land clipping of boundaries inside countries for terrestrial land mass. This is a partially addressed problem in the USA for data from the US Census, and sometimes occurs in Europe
- Quality checks to ensure no feature dupes between placetypes (e.g.
localadmin
paired withlocality,
andneighbourhood
paired withlocaladmin
) and that historical features are marked not current
Single feature display
- Preview polygons with pre-cached, small “200 kb” display ready alternate geometries as existing default polygon geometries vary widely in size from a few KB to more than 50 MB. These are useful for displaying search results with a polygon (not just with a centroid zoomed to the bounding box).
Hierarchy
- Disputed point-of-view: For administrative hierarchy, further work to power “point-of-view” on the hierarchy masks, can be adapted from Natural Earth’s efforts for automated reverse geocoding so users in various countries get locally adjusted polygon data instead of globally masked data
Names and Concordances
- U.S. Board on Geographic Names: Adding concordance IDs and toponymic information from the Geographic Names Database, containing official standard names approved by the United States Board on Geographic Names (BGN) and maintained by the National Geospatial-Intelligence Agency.
Routing
- Intersect with urbanized landcover to ensure feature point geometries are inside urban areas and polygon feature’s navigation points are in the place’s urban center (ideally downtown)
Placetypes
- Expand localadmin placetypes globally (see section above)
- Update United States placetypes for Census 2020 data releases
- Update European placetypes for newer data releases, including in Germany, France, and Netherlands.
- Add Indonesia placetypes for newer data releases
- Add Vietnam placetypes for newer data releases
- Update Brazil placetypes for newer data releases
- Update Mexico placetypes for newer data releases
- Expand postalcode polygon coverage using latest open map data, especially in Europe
- Expand constituency coverage outside of the USA using latest open map data
- Add statistical aggregations like NUTS and US Census
- Add metropolitan/micropolitan/rural area globally using original research
- Add colloquial areas (like Outer Banks) via crowdsourcing
Example applications by placetype
When decomposing address data into placetypes (using libPostal, an OpenAI prompt, or another tool), it can be helpful to think how those strings can geo match to Who’s On First records.
Decomposing Data… |
---|
…into Data Sources |
---|
Who’s On First is a “coarse” geocoder meaning it doesn’t provide street level features. But WOF data can be complemented (as Pelias does) with data from other open data projects, including: OpenAddresses, All The Places, Natural Earth, and even the ODbL licensed OpenStreetMap for a complete geocoding solution.
If you need to process and deduplicate venue data using latitude, longitude, name, and address information we also recommend Lieu (code repo and blog post).
Sources | Precision | Component | Example |
---|---|---|---|
Who’s On First and All The Places and OpenStreetMap | Exact | Venue | Some Business |
OpenAddresses and All The Places and OpenStreetMap | Exact | Street number | 155 |
OpenStreetMap | Fine | Street name | 9th Street |
Who’s On First | Course | Locality | San Francisco |
Who’s On First and Natural Earth | Course | Region | CA (short code for California) |
Who’s On First | Course | Postcode | 94131 |
OpenAddresses | Fine | Postcode+4 | 1234 |
Who’s On First and Natural Earth | Course | Country | USA (short code for United States of America) |
_NOTE: While our WOF gazetteer does include 21M venue records they are of an older vintage and do not reflect closed venues or recently opened venues. Depending on your needs, it may be better to use All The Places (scraper project for international, national, and regional chains), OpenStreetMap (caveat ODbL license restrictions), or license data from proprietary sources including Foursquare_, SafeGraph, TripAdvisor or others. Postalcode polygon country coverage can be augmented with proprietary sources like MBI.
Collaboration
Most people engage with the project by downloading and “reading” the data. In the course of using the data, you might find a few tools helpful both for finding specific records and for editing them to fix small problems in the data (like a misspelled name).
View Who’s On First records
The Spelunker is a web tool for browsing Who’s On First (shown in the screenshot immediately above) and includes search functionality. For example: view New York city.
You can also View raw data on Github. Because navigating the many whosonfirst-data
repos can be overwhelming, each individual WOF record self describes in which repo it can be found. For example, view New York city as raw GeoJSON.
Editing of Who’s On First records
We use Write Field for quick property data edits. For example, load New York city for editing. This requires a Github account to create a pull request which can be reviewed by the WOF team.
For more complex and bulk imports we use QGIS and scripts to iterate over files managed in git. This workflow is described in more detail in our recent Shapefiles blog post.
Downloads
Accessibility facilitates access for all and Who’s On First data is available to download in several formats, including:
- Shapefiles - for GIS mappers in applications like QGIS and ArcGIS
- SQLite - for database driven applications
Thanks to Geocode Earth for sponsoring processing and hosting of the downloads.
While we manage Who’s On First place records as individual text files in git repos hosted on Github, from the early days we’ve made “bundled” distributions available. Those distributions were more geared for software engineers than for a more general audience of map makers and cartographers.
Shapefile downloads for the Who’s On First gazetteer were added in April 2023 as per-country ZIP archives including admin (country, region, county, locality, neighbourhood & more), postalcode, and constituency placetypes. They only contain basic properties, names, and geometries.
SQLite databases contain the full firehose of Who’s On First data, including mixed geometry types and full set of nested GeoJSON properties. Unlike the raw GeoJSON files in the git repos, data in the SQLite databases are organized into several tables, including: spr, names, concordances, ancestors, and geojson. Geometries (default and alternate) are stored as properties in the geojson table. The field layout of the spr table is explained in the Standard Place Response section of the Shapefiles blog post.
Sponsor Who’s On First
A large project like Who’s On First relies on good will to function, some amount of organization to sustain it year to year, and funding to pay data curation, engineering development, and internet server bills.
If you or your organization relies on Who’s On First please considering sponsoring us with a recurring or one-time donation. Please reach out to Nathaniel at wofgazetteer@gmail.com to discuss options.
Contact Who’s On First
If you have feedback, please submit a new issue or discussion topic via Github.
Subscribe to WOF Announcements for “big news”, via Mailchimp. Frequency is a few times per year. Subscribe now and you’ll be first to know about our big India locality data drop!
Join our new WOF Gazetteer discussion group on Google to ask questions and share updates on your country’s latest changes to its internal administrative subdivisions. If you’re a geography geek, this one’s for you.
Need to reach out privately? Email wofgazetteer@gmail.com.
Conclusion
Open gazetteers, including Who’s On First, have come a long way in the last 10 years – largely because of the success of the global open data movement. To the many people who have advocated with your local or national government to “free” this bounty of data, thank you!
Administrative borders are rarely “visible on the ground” and gazetteers that aggregate this data remain relevant by providing important and broad access to this type of map data. Each open gazetteer project has its particular license that reflects origin stories and day-to-day workflows.
At Who’s On First we’re proud to offer our over 5M
administrative places and another 25M
supplemental places – all with CC-BY
attribution – allowing you to both read access and commercial use – with polygons, localization, and internationalization.
Who’s On First holds hands with 360 sources, has great name translations, with unique IDs, and has been confirmed to work in a wide range of applications, several of which have more than 300M
monthly active users globally.
While we can and will add even more data, we think there’s enough in Who’s On First for you to get started making a map or building an app today. We’re excited to see what you build!