NPM or Node Packaged Modules, makes possible the Node.js community’s package ecosystem. Most Node.js developers know NPM for the command-line utility used to manage dependencies, publish packages, and run scripts, but behind the scenes NPM Inc. is running a beastly infrastructure to serve more than 1500 requests per second.

Every time you run npm install orchestrate, you’re making a series of requests of NPM’s infrastructure to deliver the orchestrate.js package, and all its dependencies. And it’s a whole ton of requests, too. To get orchestrate.js, it’s 31 requests. Many of those are for dependencies stored as large blob files, and yet installing the orchestrate.js package takes just over two seconds.

orchestrate-mind-blown

As an experiment, I wanted to see if I could replicate NPM’s package metadata (that is, not the packages themselves) into Orchestrate, to see what scaling concerns emerged from handling such an enormous and volatile dataset. For example:

  • Node packages store metadata in a package.json file with a very loose schema, potentially making it hell for ElasticSearch to index effectively.
  • Although NPM has (as of this writing) only 83,432 packages, its changes feed contains many times that as changes, so catching up to NPM’s current state could be an ordeal.
  • Normally, one replicates from NPM using CouchDB’s replication protocol; I wanted to see if I could create a robust equivalent for Orchestrate. Well, I did it. You can check out the Node.js program on GitHub.

What did it take, and how did it turn out?

Normalization

When you first insert an item into an Orchestrate collection, we construct a schema based on the item’s properties, and index subsequent objects based on this schema. Orchestrate will generate a conflict if a known field ever contains an unexpected type. New fields are absorbed into the schema, but adding many fields (as in thousands) will cause ElasticSearch to stutter and even crash. This is because ElasticSearch loads these schemas into memory as part of its indexing process. If a schema is too big to store in memory, bad things will happen.

Luckily, there’s npm-normalize to normalize metadata so it’s easier to index. npmsearch uses this to format changes from NPM before it indexes them in their own ElasticSearch cluster. But I had to write two more normalizers for the time, times, and scripts fields, to turn fields like this…

"time": {
    "0.0.1": "2011-03-28T21:55:11.207Z",
    "0.0.2": "2011-03-29T04:34:10.646Z",
    "0.0.3": "2011-04-03T04:01:34.520Z",
    "0.0.4": "2011-04-24T21:26:05.885Z",
    "0.0.5": "2011-05-18T04:44:12.230Z",
    "0.0.6": "2011-05-18T04:52:03.856Z",
    "0.0.7": "2011-06-14T06:17:21.903Z",
    "0.0.8": "2011-08-14T01:40:29.803Z",
    "0.0.9": "2011-08-14T09:52:39.879Z",
}

… into this:

"time": [
    { "version": "0.0.1", "date": "2011-03-28T21:55:11.207Z" },
    { "version": "0.0.2", "date": "2011-03-29T04:34:10.646Z" },
    { "version": "0.0.3", "date": "2011-04-03T04:01:34.520Z" },
    { "version": "0.0.4", "date": "2011-04-24T21:26:05.885Z" },
    { "version": "0.0.5", "date": "2011-05-18T04:44:12.230Z" },
    { "version": "0.0.6", "date": "2011-05-18T04:52:03.856Z" },
    { "version": "0.0.7", "date": "2011-06-14T06:17:21.903Z" },
    { "version": "0.0.8", "date": "2011-08-14T01:40:29.803Z" },
    { "version": "0.0.9", "date": "2011-08-14T09:52:39.879Z" }
]

The difference is that an object with an indeterminate number of fields (the first example) is hard to index, but the second is easy – not to mention asking questions like “which packages have ever had a 1.0.0 release?” are impossible in the first case. In the second, just query value.time.version:1.0.0.

As npmsearch has discovered, ElasticSearch likes causing problems, and we’ve been there too. Normalizing our data will ensure our mappings never get too big, so we won’t experience problems related to them.

Replication

Every time a package is published, updated, or deleted, NPM logs that as a change. Since we’ll be staying in sync with NPM using their changes feed, that means we’ll have to process every one of the hundreds of thousands of changes that NPM has ever accepted. To get up to sync, we’ll want to process these as quickly as possible. Can Orchestrate handle it? Yep!

Though to be fair, npm-orchestrate only processes one change at a time, or about 5-10 changes per second. Orchestrate handled just fine. To further ensure I didn’t cause our operations team too much strife, npm-orchestrate quits at the first sign of trouble. Restarting it doesn’t lose progress, so a tool like forever could intelligently keep npm-orchestrate syncing, well, forever.

To ensure restarts, crashes, and any other problems don’t lose progress, we use a second collection called checkpoints which we update after every change with its associated sequence value. That way, when npm-orchestrate restarts, it consults the latest sequence value, and starts from there. This is much like how CouchDB uses checkpoints in its replication protocol. If npm-orchestrate fails to update its sequence value, it rolls back the last change and quits. If updating a record fails, it doesn’t update the sequence value, and quits. In both cases, restarting npm-orchestrate will pick up before the last error.

Once npm-orchestrate caught up with NPM’s latest state, it went from making 5-10 writes per second to one only every few seconds, in keeping with the writes NPM’s registry processes in realtime. When that happened, nothing changed internally. It continues to consume NPM’s changes feed like always, processing new packages, updating its checkpoints, and so on.

I’m proud it works, but more could be done to make this replication process more robust. Consider the current version a proof of concept.

Querying

Now that we’ve got all of NPM in Orchestrate, we can run search queries against it. For example, how many projects does substack( maintain? 404! How about with an MIT license? 384!

These queries aren’t exactly revolutionary compared to the complex scoring algorithms npmsearch uses, but I’m working on it. Check back soon :D

Conclusions

After some initial work on normalization, npm-orchestrate was able to sync all of the NPM registry’s metadata, effectively adding full-text search without any added code.

Though to be frank, as easy as this was, it shouldn’t even be this hard. You shouldn’t have to consider the intricacies of your databases to build your applications. That’s the whole reason Orchestrate exists. Over the next few months, we’ll be changing how we index data for search to avoid these problems, so even what normalization we performed may not be necessary.

To do all this yourself, follow these steps:

  1. Sign up today!
  2. Create an application and copy its API key
  3. In a terminal, npm install -g npm-orchestrate
  4. export ORCHESTRATE_API_KEY=YOUR_API_KEY
  5. npm-orchestrate
  6. Enjoy your very own collection of NPM metadata :D

Happy coding!