Polyglot persistence – using more than one database for a specific application – has emerged as the norm in recent years. From travel giant Orbitz to innovative SEO startup Moz, engineers have chosen utility over simplicity, and reaped the benefits (and downsides) of running multiple databases in production.
This choice has been driven by a number of factors, but the central one is that users like applications with rich experiences. They need to be able to easily search their data, connect with friends, and find relevant places near them.
Simultaneously, an abundance of new databases have emerged to fill various storage, querying, and scaling niches making polyglot persistence no longer the domain of just the biggest web properties.
Though generally credited to Scott Leberknight, Martin Fowler of Thought Works has popularized the concept of polyglot persistence. He defines it as “using multiple data storage technologies, chosen based upon the way data is being used by individual applications.” It’s a pragmatic notion that immediately seems obvious.
Fowler goes on to say that any organization (or application for the matter) will require “a variety of different data storage technologies for different kinds of data.” Fowler employs the term as way to organize the trend associated with the explosion of new databases.
In his explanation of the polyglot persistence, Fowler doesn’t identify the main driving force behind the current push to polyglot systems. The growth in users’ expectations of what applications should do is driving developers to adopt multiple querying and storage technologies.
Any nontrivial multi-platform application ingesting data of varying formats and scopes requires differentiated querying modes. These modes include key-value lookup, full-text search, geolocation, graph traversal, and time-ordered events.
This pattern can be seen in many common applications. Take Foursquare as an example. It integrates social networking (graph traversal) with geolocation and faceted search. It coalesces check-ins, comments, and reviews based on time and place. The functionality of other common apps like Yelp, Yammer, and Twitter expose the same pattern, a variety of data types alongside a variety of query modes.
Databases have emerged to handle specific query types. The maturation of Lucene has put sophisticated search into the hands of average developers. The development of BigTable and Dynamo clones have introduced to the masses KV-style stores with horizontal scaling properties. Graph stores like Neo4J have introduced a different kind of persistence and querying model to many developers.
Unfortunately, no individual database can handle all of query types that drive rich user experiences.
Armed with a handful of new tools, companies have built polyglot systems to contend with the expectations of their users. Imgur is a simple photo hosting and sharing site. Even with relatively basic needs, it is using 5 different persistent stores, including Redis, ElasticSearch, and MySQL (via Amazon RDS). Facebook (here and here) is famous for their use of MySQL, but in addition they use HBase, Lucene, Memcached, and HDFS. Klout, the social impact analysis app, uses HBase, MySQL, ElasticSearch, MongoDB, and HDFS.
The lesson is clear: in order to build compelling apps for users, developers require multiple ways to store and query data. They need tools to deliver relevant content to users based on context: where they are, what they like, and who they know. To do this developers must lean on a number of different persistence and querying technologies. This requirement is driving the construction of polyglot systems.
At Orchestrate, we’ve taken this lesson to heart. We also know that polyglot persistence is not an easily achievable goal for all developers. The cost and complexity of running multiple databases in production can be a distraction for engineers trying to build great apps.
We believe that by providing developers with an API that supports multiple persistence and querying modes, we can offer the upsides of polyglot persistence without the pain of running multiple databases. I will be writing more on this topic, as well as APIs and design concepts in the coming months.
– Ian Plosker @dstryallmodels