The Why of the How This is a blog post by thisisaaronland that was published on Feb 27, 2018 and tagged elasticsearch, go, python, spelunker, whosonfirst and why-of-the-how

This is a bit of an accidental blog post.

I mentioned Gary Gale briefly in the last blog post. Gary and I go back to Flickr / Y!Geo / GeoPlanet (née Where On Earth) days and lately he’s been spending some time poking at all the tools and data that make up Who’s On First.

What I love about Gary is that he has the diligence and the patience to find all the outstanding questions and gotchas in a project. Gary has been setting up a local instance of the Spelunker and he sent me a long list of questions about how the Spelunker does and doesn’t work with newer versions of Elasticsearch. The gist of his email being:

The WOF Spelunker is currently based on ES 2.4.6 and there’s been a lot of changes that I’ve had to accommodate

I still can’t get the punctuation filter working properly. Creating a new index with this in the schema dies with this …

One thing that’s caused significant headaches is that there’s a lot of type inconsistencies in the WOF GeoJSON, with fields that /should/ be numeric being specified as strings. As I’m bulk indexing, 1000 at a time, there’s no guaranteed order (or shard guarantee) for the documents to be processed in so I was getting a lot of mapping mismatches.

For now I’ve fixed this by coercing the problematic fields into their integer values so the mappings are dynamically created consistently. Specifically this bites when processing the properties.wk:count, properties.qs:id and properties.wof:concordances.qs:id fields. Also oddly, properties.qs:id always seems to be zero whereas properties.wof:concordances.qs:id isn’t?

I did check in but it looks like these documents only cover the minimum viable WOF schema and not all the fields available. Either that or it’s just out of date.

I have included an abbreviated version of my reply below because aside from answering some of Gary’s immediate questions it ended up touching on a lot of other interelated decisions and trade-offs in both the design and the implementation of Who’s On First that probably haven’t been discussed or documented as well as they should be.

One of the things I’ve taken to saying in recent years is that: “Sometimes we make mistakes because of circumstance and sometimes we make bad decisions because of reasons… so please just write those reasons down somewhere.”

I’d like to believe we’ve done a pretty good job of that in the Who’s On First codebase itself but in as much as a large number of people are never going to read those comments it seems like it a useful practice to do pursue, here on the blog, as well. If other people have similar questions we’d be happy to answer them and maybe we will start a general “why of the how” series of posts.

In the meantime, here’s what I said to Gary:

Did I mention I have #feelings about Elasticsearch (ES) ?

Everything you’ve just described is one reason we’re still using 2.4. It was enough of a time-sink just migrating from 1.x to 2.x that there’s never been a compelling reason to upgrade again beyond “3 (or 5 in the case of Elasticsearch, which skipped versions 3 and 4) is bigger than 2”.

Worse, I began to have little confidence that updating from version x to version y wasn’t just going to be time wasted because I soon as I was done I would have to upgrade to version z.

I realize this is pretty much the defining characteristic of “modern software development” but that doesn’t make it right.

I suppose if Elastic are already talking about “version 8” it might be worth the effort but it’s hard not to feel grumpy about it all. Anyway, all the ES stuff is kept here:

The first thing I would suggest are branches and PRs. Even just creating a milestone and outlining all the migration issues would be useful. All of the code to index ES is kept here:

And here, which is just a thin non-WOF/Spelunker-specific base class which wraps the handful of HTTP requests we make to ES, because every ES wrapper library out there tries to do all the things and quickly becomes more trouble than it’s worth:

In the beginning we relied on the supposed “schemaless magic” of ES to index heterogeneous documents but very quickly bumped in to the thing where ES tries to be clever and then shoots itself in the face when it comes to strings versus ints and of course EDTF date strings. Because there was never time to go on a prolonged ES vision quest I just handled it all in code.

Eventually we made some of that code redundant with better schema definitions:

The fact that all the indexing code is in Python is largely an historical artifact of having started in Python (the correct choice). I do not love having to “prepare” documents and in some ideal world we could just throw the properties dictionary at an ES index and get on with more exciting things.

I would also like to be able to use Go to index ES because it is simply faster and we could distribute pre-compiled binary tools, because Python’s dependency hoohah is reliably sad-making. The combination of constructing free-form JSON documents in Go, in writing all the scaffolding code to deal with ES requests/responses in Go and just limited time means it hasn’t happened yet.

This is what I meant when I said we use Go as much to test decisions/assumptions around the code, with a strictly typed language, as we do for its speed and muscle.

I see this inability to write a WOF -> ES thingy in Go, quickly and with only a manageable amount of burden, as a short-coming and something to be addressed going forward. We have a couple other places where we’ve painted ourselves in to a language-specific corner (Python) and it’s useful only in that it illustrates how not to do things in the future.

WOF can’t support all the languages itself but it definitely shouldn’t make [ insert language here ] a requirement for common things… like indexing a database. Today it does.

The JSON Schema stuff was an early attempt to see if we could enforce a certain amount of consistency and quality control around document types without making the mistakes that XHTML 2.0 made around strict-iness (who remembers XHTML 2.0 right…?) It was also an attempt to see if it could be used to automate some parts of Boundary Issues (the editorial tools) which are written in PHP/Flamework and Javascript without going Full Metal XForms (no offense to Micah) about everything.

There’s a whole other discussion about how Boundary Issues has to wrangle and round-trip documents and forms and UI/UX behaviour based on property types and ACLs but that’s a conversation for another day.

I think, as I write this now, that I had faint hopes of being able to generate the ES schemas from the JSON schemas, or at least use the latter as a starter kit for the former. In the end the JSON Schema stuff never warranted the time to prove or disprove or, more specifically, it was going to take too long prove or disprove itself either way and we didn’t have the luxury of finding out.

I am totally open to the idea that we might be able to revisit JSON Schema with more success now but it does point out one of the built-in tensions around a dataset like WOF: Namely that outside of a handful of core/common/required properties which should enforce strict-iness in any dataset as large as ours there will be type inconsistencies. It’s not ideal but it is also entirely realistic to expect and it’s not clear to me that a random qs:whatever property should cause a fatal error that prevents the casual user from getting started.

A good example of this is the nightmare around LFS and large geometries (hello, New Zealand) which is why we’re talking about making the default/common geometries in records a max of 2-10 MB and moving all the “ground truth” geometries in to dedicated “you will need to use LFS” repos.

It’s also the motivating factor behind the “standard places response” (modeled after Flickr’s standard photo response) to try and identify what the strictly enforced properties in a record are and, by extension, what we and consumers should “be liberal” about:

See the way the definition for the Id() method returns a string instead of an integer? Yeah… something something something other data sources something something something…

To answer your immediate question of “Is this stuff all documented somewhere?” the answer is yes, here:

Some of those records are incomplete but there should at least be a placeholder for all the properties. Ultimately I would like for all those machine readable documents to be used to generate human-readable documentation and code-level sanity checking (like we already do for placetypes and sources) but that is not the case today:

On top of all this, I am just going to assume that all the query syntax for ES 6-8 has changed as well? Which means both the Spelunker and the API will need to updated.

In case you’re wondering I do periodically question whether or not we should just define all our ES queries as language-agnostic templates but after a few moments of hating myself move on to other things.

So. That’s maybe more than you were hoping for, by way of answers, but welcome to my world.

It seems like it’s time to spend some energy on ES 7. Presumably there is no point in stopping at 6 if 8 is already being discussed? I would start by making a new branch and version-specific folders in:

And then updating the Python libraries accordingly. There is nothing precious in them so most things are fair game, short of embarking on a wholesale Python 2 -> 3 rewrite (which is in the cards, but not right now…) Removing as much of the type-checking hoohah from the Python code and in to the schema would be a good thing but I have a feeling expediency will dictate keeping some of the former.

The subtext here is that I too would like to have a dataset where we can enforce type-ish consistency across the board and where the penalty for stopping and fixing one bad record isn’t the potential of having to repeat the process ad infinitum. Our burden, I think, will be to live in the interim working towards that better world…

The first step should just be “indexing the data” and adjusting the schemas and “prepare” code as the circumstances demand. After that we can sort out / update all the query nonsense. If we need to sacrifice the emoji support in the short-term then that’s probably the right thing, although it is pretty cool to be able to do this:

Good times…

One immediate side-effect of the email thread with Gary is that there are now tools to crawl all the Who’s On First records and ensure that each of their properties has a corresponding record in the whosonfirst-properties repo and Gary is writing code to test those files.

Each of those records still needs to have descriptive metadata added and some of the records will be bunk and in need of being superseded or deprecated but at least its progress.

Tiny steps may be tiny but they are still forward momentum so “Onwards!” and all that good stuff…