The concept of a primary key in a relational database is an interesting subject because in almost all examples online, in every blog post and tutorial, you’ll see the primary key be defined as an auto-incrementing ID.


This field is created and managed by the database. Its purpose is to uniquely identify the record, but the value itself has no semantic relationship associated to the rest of the data in that row.

In SQL storage engines, auto-incremented IDs are a nice feature. They're created and managed for you. They’re guaranteed to be monotonically increasing. This ordering is important as it ensures the items in a table are insertion-ordered and, therefore, clustered on disk based on time of insertion. This is a useful property when data generated at the same time is generally accessed at the same time.

Unfortunately, distributed storage engines (e.g., NoSQL technologies, like those that power Orchestrate) generally don’t provide auto-incrementing primary keys. This is because auto-incrementing IDs are hard in distributed systems. They require a coordination system to maintain the property of monotonicity. Further, insertion order is difficult to define in a distributed environment largely due to clock skew.

(Aside: Google’s Spanner has a novel solution to this problem, something called “TrueTime”. “TrueTime” is a time API that is built with the understanding that the exact time is uncertain. When calling the now() method, a range is returned with the lower bound being the earliest time it could be and the upper being the latest it could be. This interval is the window of uncertainty. Google minimizes this range by using atomic clocks and GPS for clock synchronization.)

While auto-incrementing IDs are a nifty feature, they require a level of coordination in a database that doesn't play well with the kind of distributed storage solutions being built today. It’s also worth noting that over their years of use in SQL databases, they’ve become somewhat of a crutch when designing the data model for a system.

In Orchestrate, we do plan to add server-generated IDs to the service soon. You can find the UserVoice for the feature here (please upvote it if you need it), but in the meantime it’s worth noting that, in many cases, clever naming strategies for your keys can serve you much better than auto-increment IDs. It’s worth asking yourself, is the insert-ordering for this type of data necessary?

The “Natural Key”

A natural key is a term we sometimes use to describe a field that is the obvious and most common identifier for a piece of data. For example, if we imagine a user document of some kind it is likely to have a common set of fields like so:

  "uid": "some UUID",
  "nickname": "jsmith",
  "fullname": "John Smith",
  "email": "[email protected]",
  "username": "orcy",
  "password": "some hashed value",
  "created_at": 1394019038077,
  "verified": true

In the document above we have three natural keys, any of which could be used to uniquely identify that document. The “email” field, “uid” field and the “username” field. An email address is a piece of information that is guaranteed to be unique to each user because everybody has their own email address. We know, based on the purpose of the data in those fields, they are not going to be the same between user documents. They're unique to a particular user’s record.

In fact, even in RDBMS’s, developers sometimes should be using a natural key as their primary key instead of an auto-incremented ID. This will lead to better performance when the natural key is the most commonly used identifier, but is often not considered out of habit.

An important factor for deciding on a key is that it should be something that will never change. Your data model will likely involve a graph of relationships between objects. These relations will include the keys for the items involved (foreign keys in RDBMS). If you choose a field whose value may change for an item, then migrating those related items when the field changes may be complex and expensive. In the example above, the email or username may not be a good choice if users are allowed to change either value.

A natural key is often a better fit for data than what is typically chosen as the primary key, but they are not a panacea in this problem space. What if you have a use case that benefits from particular sets of keys being grouped together? For this case you should consider a “composite key”.

The Composite Key

Like a natural key, a composite key can be built from a group of fields in a document or from other categorization information about the dataset as a whole. Composite keys are very useful when you have a use case that benefits from clustered keys and you have a database that provides ordered access. Orchestrate is an example of such a system.

If we go back to our user document, we could build a composite key by taking the “createdat” field, a separator character like “”, and prefixing them to the “username” field. From the example above this would give us the key “1394019038077_orcy”. We’re now able to take advantage of the semantics of this key format to ensure that user records are stored in the order that they were created at. Keys with similar timestamps are clustered together.

We can also use composite keys to great effect as a namespacing strategy. For example, let’s take a database that supports key filtering. With it we could create a composite key format for an IoT sensor device something like this: “{deviceID}{sensorID}{timestamp}” and query over it in useful ways. We could filter to get all documents for the keys with a specific “deviceID” or filter further and request all documents with a particular “deviceID” and “sensorID”. We could even make use of the timestamp field and filter down to documents that only match a particular point in time going forwards.

Orchestrate supports key filtering in our API via the KV list query. We could perform one of those queries like this:

curl -X GET "{deviceId}_{sensorID}_"
        --user "{yourAPIKey}:"

Composite keys are a powerful key format that builds on top of the concept of a natural key by encoding more information about a document into a key, making it possible to filter and query the data efficiently in interesting ways.

Despite the options for avoiding auto-incremented IDs, there are use cases that will require this kind of ID generation strategy in a distributed system. As mentioned above this is a hard problem to solve, but there are a few solutions out there.

“Roughly” Insertion-Ordered Keys

Back in 2010, Twitter reached a scaling problem with their use of MySQL. They were in the process of switching databases and realized that, in the new database, they would not be able to rely on auto-incremented IDs. They'd relied on it up until that point to ensure that tweets posted around the same time had IDs that indicated as much.

In their words, “We needed something that could generate tens of thousands of ids per second in a highly available manner. This naturally led us to choose an uncoordinated approach.” In pursuit of a suitable replacement for the monotonicity guarantee provided by auto-incrementing IDs, they developed a service called Snowflake. You can read more about the rationale for this service in their blog post.

It's possible to generate snowflake-like keys without much effort. Snowflake keys are 64 bits. The first 41 bits represent the time in milliseconds. The next 10 bits are a machine identifier. The 12 bits a sequence ID. You can easily build a similar solution using epoch timestamps, a unique machine identifier, and a counter that rolls-over. These keys will not be as compact as snowflakes’, but will maintain the same “roughly” time ordered property.

Next Steps

Data modeling can be difficult. Choosing the right keys can be a challenging process. If you’re experimenting with Orchestrate or would like help with the design of your data model, you can get in touch with us very easily through the Dashboard. We’re happy to help in any way we can. We also have a Community channel where you can find us throughout the day or by email at [email protected]. We’re happy to answer questions, look at code, and point you to resources.

Most importantly, have fun building great things with Orchestrate. By the way, we're live now in the CenturyLink Cloud. Sign up today for Orchestrate at