whosonfirst-names
What things are called in Who's On First
Names and aliases (and languages) – aka RFC 5646
Who's On First uses a variation of RFC 5646 for identifying names. The W3C's Language tags in HTML and XML page describes RFC 5646 like this:
RFC 5646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in a number of ways – for one thing, there is only one place you need to look now for valid subtags.
Although it provides some additional options for identifying common language variations, RFC 5646 includes all of the tags that were previously valid. If you have been using RFC 1766, RFC 3066, or RFC 4646 you do not need to make any changes to your tags.
The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.
language-extlang-script-region-variant-extension-privateuse
RFC 5646 (BCP 47) comformance
Sometimes RFC 5646 is referred to as BCP (Best Current Practice) 47.
-
We follow the same structure outlined in RFC 5646 but use
_
(underbar) characters instead of-
(dash) characters for delimiting individual properties of a language identifier. -
We use three-letter language codes (rather than two-letter codes) to identify the primary language.
-
The use of either the "script" or "region" subtags is allowed, although neither is required.
-
While not explicitly forbidden neither the "extlang" or the "variant" subtags are commonly used, and Who's On First tools for parsing name labels may not support them.
-
We use private extensions, specifically a
-x-[NAME_TYPE]
label.
RFC 5646 (BCP 47) tag conversion
- Replace the "_" separators with "-"
- You probably want to replace 3-letter language codes with 2-letter language codes
For example
eng_x_preferred → en-x-preferred fre_ca_x_variant → fr-ca-x-variant
Code-y bits (and backwards compatibility)
The mapzen.whosonfirst.names Python library provides libraries and functions for converting between Who's on First
, Geoplanet
and RFC 5646 subtags
.
Note: When converting to subtags the library does convert three-letter language codes to two-letter language codes.
For example:
import mapzen.whosonfirst.names lbl = mapzen.whosonfirst.names.labels() names = ("fin_p", "eng_s", "unk_v") for n in names: print n n2 = lbl.convert(n, 'geoplanet', 'wof') print n2 n3 = lbl.convert(n2, 'wof', 'subtags') print n3 n4 = lbl.convert(n3, 'subtags', 'wof') print n4 n5 = lbl.convert(n4, 'wof', 'geoplanet') print n5
Would yield:
fin_p fin_x_prefered fin-x-prefered fin_x_prefered fin_p eng_s eng_x_colloquial eng-x-colloquial eng_x_colloquial eng_s unk_v und_x_variant und-x-variant und_x_variant unk_v
A short history
In the beginning:
-
We had names according to QuattroShapes
-
We re-indexed all the names, aliases and translations from WOE (7.10) and concordances between WOE and the Gazetteer for many of them. Those we don't have concordances for will simply be imported in to the Gazetteer as new records complete with their names and aliases.
-
We had concordances for many places in Geonames which also has many of its own aliases and translations, sometimes exceeding those of WOE
WOE defines two properties for a name:
- a ISO 639-3 language code
- a name "type", which is a canned list as defined by the WOE folks:
The Name_Type field is a single letter code that describes the alias as follows: * P is a preferred English name * Q is a preferred name (in other languages) * V is a well-known (but unofficial) variant for the place (e.g. "New York City" for New York) * S is either a synonym or a colloquial name for the place (e.g. "Big Apple" for New York), or a version of the name which is stripped of accent characters. * A is an abbreviation or code for the place (e.g. "NYC" for New York)
WOE also distinguishes between a name
and an alias
so in their world you end up with something like:
Name: Montréal Language: FRE Alias (ENG_P): Montreal Alias (KOR_Q): 몬트리올
WOE does not however account for the fact that some countries have multiple languages.
With all that in mind, decided that:
- We should support multiple languages for a place and label placement
- We should just use
p
for a preferred name, regardless of language - We should use a
name
namespace for names because it is explicit (likewise forfullname
) wof:name
is English by default and is a string (rather than a list of strings) that can be used for labels
For example:
{ "wof:name": "Montreal", "wof:lang": [ "eng", "fre" ], "name:eng_p": "Montreal", "name:eng_a": "YMQ", "name:fre_p": "Montréal", "name:kor_p": "몬트리올", }
But wait, there's more!
One day we met @nyampire who told us that he had a gazetteer of places published by the Japanese government that contained place names in Kanji, Kana and English. Since Kanji is a script the solution described above doesn't work. So now we're using RFC 5646 and subtags.