NPM is wonderful. It’s an indispensable tool. Discovery, however, isn't as transparent as you might like it to be. So let’s take advantage of Orchestrate’s powerful search functionality to index NPM and join it with Github data. That way, we can search for packages and easily pick the best one for our applications. We described the approach in an earlier post, but in this one we'll start to explain how we built it.

Scout JS example search

We’ll use a few different technologies to make this work.

Check out the final application here:


We start with pulling in all the packages from NPM. Luckily, this is pretty easy because NPM runs a CouchDB that we can clone all the data from.

  • We use Follow to get a feed of the data and then add each item to Orchestrate.
function getFeed (seq) {
  feed = new follow.Feed({
    db: creds.npm_url,
    include_docs: true,
    since: seq || 0

  feed.on('change', function (change) {

  feed.on('error', function(er) {
    console.error('Since Follow always retries on errors, this must be serious', er);
    throw er;

  • We use Async to queue all the changes and add them into Orchestrate. We want to pause the NPM data stream at the max concurrent operations so we can throttle the connections. To do that we use the resume() and pause() methods on the feed when the queue is full or empty.
var async = require('async');

var queue = async.queue(queueWorker, CONCURRENT_DOWNLOADS);

queue.drain = function () {

queue.saturated = function () {
  • We need to normalize the NPM data. Since NPM is user submitted, the data can vary package to package. We use npm-normalize to normalize some parts. Then, we use custom functions to normalize the scripts and time fields.
function normalizeTime (doc) {
  var names = ['times', 'time'];
  names.forEach(function (time) {
    if (doc[time]) {
      var times = [];
      Object.keys(doc[time]).forEach(function (field) {
          version: field,
          date: doc[time][field]
      doc[time] = times;

  return doc;

function normalizeScripts (doc) {
  if (doc.scripts) {
    var scripts = [];

    Object.keys(doc.scripts).forEach(function (field) {
      var command = doc.scripts[field];

      // handle fickle script objects
      if (typeof command === 'object') {
        command = JSON.stringify(command);

        script: field,
        command: command

    doc.scripts = scripts;

  return doc;
  • We have an update function that runs the data through the normalize functions and saves it into Orchestrate.
function update (doc) {
  var ref;
  var id = doc._id;

  var normalized = [
  ].reduce(function (doc, func) {
    if (doc) {
      return func(doc);
    } else {
      return doc;
  }, doc);

  var data = normalized || {};
  data['date_scraped'] =;

  return db.put('npm', id, data)
  .fail(function (err) {
    console.log('error with update', err);
  • You can view the full code here: npm.js

Github Repo Data

Now that we have the NPM packages in our database, we can start bringing in the other data we want. We pull in the following Github repo data: the number of downloads in the past 30 days, and for individual months.

  • To access the Github API from node, we’ll use the Octonode package.

  • There are a few things we want to be aware of. First, not all packages have Github repos. We use git-url-parse to parse the repository field and determine if it’s a Github repo. The second thing to watch out for is Github’s rate limit on their API. We authenticate the API request so the limit is 5000/hour. Otherwise, it’s 60/hour. The third thing to watch out for is some packages have a Github repo in the package.json, but it’s been delete or moved.

function getRepo (repo) {
  var deferred = Q.defer();

  // get the repo
  var client = github.client(creds.github);
  var ghrepo = client.repo(repo);, data) {
    if (err) {

      // the repo doesn't exist anymore, skip it
      if (err.statusCode === 404) {
        console.log(repo, 'not found');

      // we are over our rate limit
      if (err.statusCode === 403) {
        var resetTime = (_.get(err, 'headers.x-ratelimit-reset') * 1000);
        var delay = resetTime -;

        return deferred.reject({
          type: 'API-limit',
          delay: delay,
          resetTime: resetTime,

      return deferred.reject({
        error: err,


  return deferred.promise;
  • Now the update function can get the repo and then update our database.
function update (repo) {
  if (!repo) return Q({});

  return getRepo(repo)
  .then(function (data){
    if (!data) return null;

    // add the date_scraped
    data['date_scraped'] =;

    // save to database
    return db.put('github', repo, data);

NPM Downloads

Next up is getting the download counts from NPM. This data doesn’t come in from the packages database. Instead, there is a separate API we can use to get the downloads for a given time period.

  • We get the downloads for a given month. Next, we pass in the Package ID and a date that’s anywhere in the month we want. Then, we find the first and last days of the month and query the API.
function getDownloadsForDate (package, date) {
  var deferred = Q.defer();

  var startOfMonth = moment(date).startOf('month').format('YYYY-MM-DD');
  var endOfMonth = moment(date).endOf('month').format('YYYY-MM-DD');

  var url = '' + startOfMonth + ':' + endOfMonth + '/' + package;

  request.get(url, function (error, response, body){
    if (error) {
      deferred.reject(new Error(error));
    } else {
      var data = JSON.parse(body);
        downloads: _.get(data, 'downloads'),
        date: {
          year: moment(date).startOf('month').format('YYYY'),
          month: moment(date).startOf('month').format('MM'),

  return deferred.promise;
  • Next, we’ll get the download count for the past 30 days. This comes in as download count for each day. We’ll store the daily values, loop over the values, and add them up to get a total.
function getDownloadsForLastFullMonth (package) {
  var deferred = Q.defer();

  var url = '' + package;

  request.get(url, function (error, response, body){
    if (error) {
      deferred.reject(new Error(error));
    } else {
      var data = JSON.parse(body);

      var downloads = 0;
      var dates =, 'downloads'), function(item){
        downloads += item.downloads;

        return {
          'date': new Date(,
          'count': item.downloads,

        downloads: downloads,
        dates: dates,

  return deferred.promise;
  • Then our update function can call these two functions and save the data into the database.
function update (package) {
  var data;

  return Q.all([
    getDownloadsForDate(package, moment().subtract(1,'months')),
    var dailyResults = results[0];
    var monthlyResults = results[1];

    // build a new object to be merged into the database
    data = {
      'daily': dailyResults.dates,
      'daily_total': dailyResults.downloads || 0,

    // save the monthly download count as a unique field. this way it keeps it over time.
    // but overwrites any existing month incase the data changed.
    // stores data in `month_YEAR_MONTH` format.
    data['month_' + + '_' +] = monthlyResults.downloads;

    return db.merge('downloads', package, data, {'upsert':true});
    return data;
  .fail(function (err) {
    console.log('err', err);
  • We use Orchestrate’s merge and upsert options so it merges the data if it already exists, or creates a new record if not. This way, we start building historical data instead of overwriting it with each update.
db.merge(‘downloads’, package, data, {‘upsert’:true});

Normalizing and Denormalizing Data

Since we’re pulling in data from multiple sources, we want to keep it in separate collections in Orchestrate. The NPM package data goes in a collection called npm and the Github data goes in, yes, you guessed it, Github. The NPM download data goes in downloads. This makes it easy to handle updates from any source. However, in order to search on these different fields, we need to denormalize the data into one collection. We’re going to use a new collection called packages.

  • When we denormalize the data, we have to create a new ranking field that we can use to sort the results by. This looks at the number of downloads, stars on Github, and the number of days since the last update.
function calculateRank (data) {
  var downloads = _.get(data, 'downloads.daily_total') || 0;
  var stars = _.get(data, 'github.stargazers_count') || 0;
  var forks = _.get(data, 'github.forks_count') || 0;
  var daysSinceUpdate = (new Date() - new Date(_.get(data, 'modified'))) / MS_PER_DAY;
  var updatedWeight = (daysSinceUpdate < 180) ? 1 : 0;

  // handle packages that link to other repos
  // like this:
  if ((downloads / stars) < 0.3) {
    stars = 0;
    forks = 0;

  return (
    ((stars/MAX.STARS) * WEIGHTS.STARS) +
    ((forks/MAX.FORKS) * WEIGHTS.FORKS) +
    ((updatedWeight) * WEIGHTS.UPDATES)

Deploying to AppFog

Now that our code is ready, we want to deploy it to a server so we can have it run all the time and constantly update the data. We’ll use AppFog’s Platform-as-a-Service (PaaS) to handle all the hardware and hosting for us.

  • First, we need to create a Cloud Foundry account and add AppFog to it. There’s a detailed tutorial here: Deploy an Application to AppFog
  • Once the account is set up we install the Cloud Foundry CLI. This lets us manage our instances and deploy from the command line. Download the installer from Github.
  • When the CLI is installed, we can login to our account using our username and organization.
$ cf login -a -u fox.mulder -o XFILES
  • Then, it’s easy to deploy to AppFog with a single command.
$ cf push yourappname

This command uses the npm start script in our package.json file.


That's how we store NMP in Github and Orchestrate for easy deployment in AppFog.