Gather round the campfire, children. I want to tell you a little story about Elasticsearch. It’s a story of heartache and pain, as all good old stories are, and a lesson for the ages about the perils of schemalessness gone wrong. In the end, I’ll tell you how Orchestrate righted these wrongs once and for all, but first let me take you back to bygone days, back before we got ourselves into this mess.

Back before Elasticsearch, we built search engines directly on Lucene. It was a compact library, without a REST API or JSON (neither had been invented yet), and there was no such thing as a schema, because the only thing you could store in a document was text. For a vestige of this past, consider that most search still defaults to relevance sorting.

Over the next few years, Lucene added lots of new field types (numeric, boolean, geospatial, and chronological, for starters), and the SOLR project wrapped all those features into a convenient web service. But SOLR also introduced something insidious: the software wouldn’t do anything at all until you defined a schema for all your data. And even then, if you ever wanted to edit your field mappings, you’d have to restart the server and rebuild the whole index.

When Elasticsearch appeared on the scene, it changed everything. Instead of using XML and an application server, Elasticsearch used JSON over REST, with server nodes automatically clustering themselves to balance the load. Most intriguingly, Elasticsearch promised to be ‘schemaless.’ You could deploy a server and start adding documents without having to fuss around with schema definitions and field types.

Of course, you could always set up a schema anyhow (called a “mapping”), but according to their documentation, one of the most important features of Elasticsearch is its ability to be schema-less. Even the case studies on the Elasticsearch website make a mighty big deal about it. But that’s where things went wrong… because Elasticsearch was really just Lucene in disguise!

Elasticsearch uses Lucene internally for all of its core storage and text analysis. And just like Lucene and SOLR, Elasticsearch expects every document in an index to comply with a schema. The only difference is that Elasticsearch will try to figure out your schema by looking at your documents whenever you skip creating it yourself.

Take a look at what happens when I add two documents to a normal Elasticsearch index, each with a different type of value for a field named ‘a’.

curl -XPUT 'localhost:9200/index/document/1' -d '{ "a" : "b" }'
curl -XPUT 'localhost:9200/index/document/2' -d '{ "a" : 1 }'

Looks good. Everything seems to work correctly. We can even ask Elasticsearch for it’s inferred mapping, to see how it’s treating the field internally:

curl -XGET 'localhost:9200/index/document/_mapping'
{
  "index": {
    "mappings": {
      "document": {
        "properties": {
          "a": { "type": "string" }
        }
      }
    }
  }
}

The response shows that Elasticsearch recognized the ‘string’ type from the first document, adding the observed field to the schema definition. When the second document was indexed, Elasticsearch internally coerced the numeric value (1) into a string (‘1′). No harm done, right?

But what happens if we delete the index and start over again, this time reversing the order of the incoming documents and submitting the document with the numeric value first?

curl -XPUT 'localhost:9200/index/document/1' -d '{ "a" : 1 }'
curl -XPUT 'localhost:9200/index/document/2' -d '{ "a" : "b" }'

In this case, we don’t get very far. The server returns a 400 Bad Request when we submit the second document:

{
  "status" : 400
  "error" : "MapperParsingException[failed to parse [a]]; nested: NumberFormatException[For input string: \"b\"]; "
}

Looks like trouble: Elasticsearch is trying to parse the string value ‘b’ as a number and throwing a NumberFormatException when this string can’t be parsed numerically. Sure enough, if we ask for the type mapping, here’s what we get:

{
  "index": {
    "mappings": {
      "document": {
        "properties": {
          "a": { "type": "long" }
        }
      }
    }
  }
}

As you can probably figure for yourself, this is where the whole ‘schemaless’ thing starts to fall apart. You might think you can solve this issue by deleting the first document before trying to reindex the second document, but as you can see, it still won’t work:

curl -XDELETE 'localhost:9200/index/document/1'
curl -XPUT 'localhost:9200/index/document/2' -d '{ "a" : "b" }'

We get the same 400 Bad Request error as before, with the same MapperParsingException message. Even though the index is completely empty now, the inferred mapping configuration just won’t go away until you deliberately remove the schema and start from scratch.

When I originally uncovered all these issues, I got involved on the Elasticsearch mailing list and wrote comments on various issues on the project bug tracker. But the core Elasticsearch developers seem to disagree that this is even a problem, telling me that most of the time, fields with the same name in different types usually represent the same ‘thing’, and so would be mapped in the same way, which leads me to think that we should leave things as they are.

At Orchestrate, where we’re using Elasticsearch to implement the search API for our multi-tenant, multi-storage-engine database service, this resulted in our share of problems. Over the course of time, our Elasticsearch cluster accumulated thousands of indices and tens of thousands of field definitions, each with its own persistent in-memory and on-disk data structures.

That doesn’t sound very ‘schemaless’ now, does it?

Forcing our users to define schemas (or to manage the schemas inferred by Elasticsearch) is completely antithetical to our simplicity goals. We’re asking them to design their data using structures that work well for Elasticsearch rather than with the structures that work well for their application.

And that ain’t right.

Ultimately, we wanted to design an abstraction that would let us truly offer schemaless storage and retrieval, while continuing to use Elasticsearch as a low-level storage and indexing engine.

And so today, I’m happy to announce a new algorithm called the ‘Tuplewise Transform’, which jettisons all the weird baggage from Lucene and Elasticsearch, and finally allows us to offer a truly schemaless offering.

Here’s how it works:

Before storing a JSON document, we transform it into an array of tuples, each with a field name and a value. For example, take a look at this simple JSON document:

{
  "id" : 123,
  "message" : "free ponies!"
  "author" : {
    "id" : "abc",
    "name" : "benji"
  }
}

Processing this JSON with the Tuplewise Transform would yield a new JSON document, which we would index into Elasticsearch like this:

{
  "tuples" : [{
    "field_name" : "id",
    "num_value" : 123
  },{
    "field_name" : "message",
    "text_value" : "free ponies!"
  },{
    "field_name" : "author.id",
    "text_value" : "abc"
  },{
    "field_name" : "author.name",
    "text_value" : "benji"
  }]
}

As you can see, the hierarchy of key/value pairs from the original JSON has been flattened into an array of tuples, each with a field name and value. We can include to different fields named ‘id’ with different core types (strings and integers) without ever throwing schema conflict errors. This lets us encode every possible JSON document that a user might submit, using only a handful of static field definitions, into an isomorphic representation optimized for the underlying Lucene internals.

Once we’ve transformed the JSON, we never need to infer new field definitions into a permanent schema or check for conflicting types. And since indexing conflicts are a thing of the past, we don’t have to partition customer data into application-specific indices. With the Tuplewise Transform, we can enforce isolation between tenants without having to allocate any permanent in-memory or on-disk resources. Creating new applications incurs no cost whatsoever.

Likewise, a typical Lucene query against this index:

id:[0 TO 99] AND message:hello

…when processed by the Tuplewise Transform, would yield a new Elasticsearch query with clauses matching the Tuplewise representation of the JSON, like this:

{
  "query" : {
    "filtered" : {
      "query" : {
        "nested" : {
           "query" : {
             "bool" : {
               "must" : [{
                 "term" : {
                   "tuples.field_name" : "message"
               }},{
                 "match" : {
                   "tuples.text_value" : "hello"
               }}]
             }
           },
           "path" : "tuples"
        }
      },
      "filter" : {
        "nested" : {
          "filter" : {
            "bool" : {
              "must" : [{
                "term" : {
                  "tuples.field_name" : "id"
              }},{
                "range" : {
                  "tuples.num_value" : {
                    "gte" : 0,
                    "lte" : 99
                  }
              }}]
            }
          },
          "path" : "tuples"
        }
      }
    }
  }
}

It looks complicated, but the logic is straightforward. More importantly, as a user you’ll never see any of these transformations. Before we fetch your results, we reconstruct the original JSON hierarchy, so that your data looks exactly the same as before undergoing the Tuplewise Transform. And we make sure these queries live up to our own performance standards.

Generally, the Tuplewise Transform only adds a few extra milliseconds to the overhead, so for example, a query executing in 30 millis on vanilla Elasticsearch will run in 35 millis on a Tuplewise Elasticsearch deployment.

I told you this story would have a happy ending, didn’t I? The bygone days are truly gone, and we can finally stop fretting over the conflicts of the past. Orchestrate’s new Tuplewise Transform finally lets you index and query any valid JSON data, without giving a moment of thought to schema conformance. It’s the first-ever truly schemaless implementation of Elasticsearch, and we’re proud to make it available today as part of the Orchestrate platform.