NPM is wonderful. It’s an indispensable tool. Discovery, however, isn't as transparent as you might like it to be. So let’s take advantage of Orchestrate’s powerful search functionality to index NPM and join it with Github data. That way, we can search for packages and easily pick the best one for our applications. We described the approach in an earlier post, but in this one we'll start to explain how we built it.

Scout JS example search

We'll use a few different technologies to make this work.

Check out the final application here: scoutjs.com

Deploy a New Virtual Server with MongoDB

If you don't have a CenturyLink Cloud account yet, head over to our website and sign up for a free trial. You'll need it to access CenturyLink Cloud products.

Our first step is to deploy a new CenturyLink Cloud virtual server. Follow the steps below.

  1. Log into the CenturyLink Cloud control portal at https://control.ctl.io/.
  2. On the left side menu, click Infrastructure and then Servers.

    Scout JS

  3. On the left-hand side of the server panel, click on the region for the server we will provision.

  4. Click create and then server.
  5. Complete the setup form for your new server. Be sure to fill out the fields for server name and admin/root password.
  6. For operating system, select "CentOS 7 | 64-bit".
  7. Click create server.
  8. Your server provisioning request will enter the queue. You can watch the progress of your request on the screen. Your server is provisioned when the status of all tasks in the queue is complete.

    Scout JS

  9. After your new server is provisioned, in the CenturyLink control portal, click Infrastructure on the left side menu, and then click Servers.

  10. Navigate to your new server and click on its name.
  11. Click the more menu, and then click add public ip.
  12. Check the box for SSH/SFTP (22).
  13. Click custom port... and then single port.
  14. Type "27017" in the blank box to open up the MongoDB server port.

    Scout JS

  15. Click add public ip address.

Installing and Configuring MongoDB

  1. Navigate to your server in the CenturyLink Cloud control panel as in the previous section. Your server's public IP address will be noted on the screen.
  2. From a shell on your local machine, connect to your new server with the following command. Replace "YOUR.VPS.IP" with your server's public IP address.
    ssh [email protected]
    
  3. Install the MongoDB server software by running the following commands.
    $ yum install -y mongodb mongodb-server
    
  4. With your favorite text editor, open /etc/mongod.conf. Look for the line that begins "bind_ip" and comment it out. The top of your file should now look like this:

    ##
    ### Basic Defaults
    ##
    
    # Comma separated list of ip addresses to listen on (all local ips by default)
    #bind_ip = 127.0.0.1
    
  5. Start the MongoDB service by running the following command.
    $ service mongod start
    

Data

We start with pulling in all the packages from NPM. Luckily, this is pretty easy because NPM runs a CouchDB that we can clone all the data from.

  • We use Follow to get a feed of the data and then add each item to MongoDB.

    function getFeed (seq) {
      feed = new follow.Feed({
        db: creds.npm_url,
        include_docs: true,
        since: seq || 0
      });
    
      feed.on('change', function (change) {
        queue.push(change);
      });
    
      feed.on('error', function(er) {
        console.error('Since Follow always retries on errors, this must be serious', er);
        throw er;
      });
    
      feed.follow();
    };
    
  • We use Async to queue all the changes and add them into MongoDB. We want to pause the NPM data stream at the max concurrent operations so we can throttle the connections. To do that we use the resume() and pause() methods on the feed when the queue is full or empty.

    var CONCURRENT_DOWNLOADS = 10;
    var async = require('async');
    
    var queue = async.queue(queueWorker, CONCURRENT_DOWNLOADS);
    
    queue.drain = function () {
      feed.resume();
    };
    
    queue.saturated = function () {
      feed.pause();
    };
    
  • We need to normalize the NPM data. Since NPM is user submitted, the data can vary package to package. We use npm-normalize to normalize some parts. Then, we use custom functions to normalize the scripts and time fields.

    function normalizeTime (doc) {
      var names = ['times', 'time'];
      names.forEach(function (time) {
        if (doc[time]) {
          var times = [];
          Object.keys(doc[time]).forEach(function (field) {
            times.push({
              version: field,
              date: doc[time][field]
            });
          });
          doc[time] = times;
        }
      });
    
      return doc;
    };
    
    function normalizeScripts (doc) {
      if (doc.scripts) {
        var scripts = [];
    
        Object.keys(doc.scripts).forEach(function (field) {
          var command = doc.scripts[field];
    
          // handle fickle script objects
          if (typeof command === 'object') {
            command = JSON.stringify(command);
          }
    
          scripts.push({
            script: field,
            command: command
          });
        });
    
        doc.scripts = scripts;
      }
    
      return doc;
    };
    
  • We have an update function that runs the data through the normalize functions and saves it into MongoDB.

    function update (doc) {
      var deferred = Q.defer();
      var ref;
      var normalized = [
        normalize,
        normalizeTime,
        normalizeScripts
      ].reduce(function (doc, func) {
        if (doc) {
          return func(doc);
        } else {
          return doc;
        }
      }, doc);
    
      var data = normalized || {};
      data['id'] = doc._id;
      data['date_scraped'] = Date.now();
    
      MongoClient.connect(mongodbUrl, function (err, db) {
        var collection = db.collection('npm');
    
        collection.insert(data, function (err, rec) {
          deferred.resolve(err);
    
          db.close();
        });
      });
    
      return deferred.promise;
    };
    
  • You can view the full code here: npm.js

Github Repo Data

Now that we have the NPM packages in our database, we can start bringing in the other data we want. We pull in the following Github repo data: the number of downloads in the past 30 days, and for individual months.

  • To access the Github API from node, we'll use the Octonode package.

  • There are a few things we want to be aware of. First, not all packages have Github repos. We use git-url-parse to parse the repository field and determine if it's a Github repo. The second thing to watch out for is Github's rate limit on their API. We authenticate the API request so the limit is 5000/hour. Otherwise, it's 60/hour. The third thing to watch out for is some packages have a Github repo in the package.json, but it's been delete or moved.

    function getRepo (repo) {
      var deferred = Q.defer();
    
      // get the repo
      var client = github.client(creds.github);
      var ghrepo = client.repo(repo);
      ghrepo.info(function(err, data) {
        if (err) {
    
          // the repo doesn't exist anymore, skip it
          if (err.statusCode === 404) {
            console.log(repo, 'not found');
            deferred.resolve(null);
          };
    
          // we are over our rate limit
          if (err.statusCode === 403) {
            var resetTime = (_.get(err, 'headers.x-ratelimit-reset') * 1000);
            var delay = resetTime - Date.now();
    
            return deferred.reject({
              type: 'API-limit',
              delay: delay,
              resetTime: resetTime,
            });
          };
    
          return deferred.reject({
            error: err,
          });
        };
    
        deferred.resolve(data);
      });
    
      return deferred.promise;
    };
    
  • Now the update function can get the repo and then update our database.

    function update (repo) {
      if (!repo) return Q({});
    
      return getRepo(repo)
        .then(function (data){
          if (!data) return null;
          var deferred = Q.defer();
    
          // Add repo as a key
          data['repo'] = repo;
    
          // add the date_scraped
          data['date_scraped'] = Date.now();
    
          MongoClient.connect(mongodbUrl, function (err, db) {
            var collection = db.collection('github');
    
            collection.insert(data, function (err, result) {
              deferred.resolve(result);
              db.close();
            });
          });
    
          return deferred.promise;
        });
    };
    

NPM Downloads

Next up is getting the download counts from NPM. This data doesn't come in from the packages database. Instead, there is a separate API we can use to get the downloads for a given time period.

  • We get the downloads for a given month. Next, we pass in the Package ID and a date that's anywhere in the month we want. Then, we find the first and last days of the month and query the API.

    function getDownloadsForDate (package, date) {
      var deferred = Q.defer();
    
      var startOfMonth = moment(date).startOf('month').format('YYYY-MM-DD');
      var endOfMonth = moment(date).endOf('month').format('YYYY-MM-DD');
    
      var url = 'https://api.npmjs.org/downloads/point/' + startOfMonth + ':' + endOfMonth + '/' + package;
    
      request.get(url, function (error, response, body){
        if (error) {
          deferred.reject(new Error(error));
        } else {
          var data = JSON.parse(body);
          deferred.resolve({
            downloads: _.get(data, 'downloads'),
            date: {
              year: moment(date).startOf('month').format('YYYY'),
              month: moment(date).startOf('month').format('MM'),
            },
          });
        }
      });
    
      return deferred.promise;
    };
    
  • Next, we'll get the download count for the past 30 days. This comes in as download count for each day. We'll store the daily values, loop over the values, and add them up to get a total.

    function getDownloadsForLastFullMonth (package) {
      var deferred = Q.defer();
    
      var url = 'https://api.npmjs.org/downloads/range/last-month/' + package;
    
      request.get(url, function (error, response, body){
        if (error) {
          deferred.reject(new Error(error));
        } else {
          var data = JSON.parse(body);
    
          var downloads = 0;
          var dates = _.map(_.get(data, 'downloads'), function(item){
            downloads += item.downloads;
    
            return {
              'date': new Date(item.day).getTime(),
              'count': item.downloads,
            };
          });
    
          deferred.resolve({
            downloads: downloads,
            dates: dates,
          });
        }
      });
    
      return deferred.promise;
    };
    
  • Then our update function can call these two functions and save the data into the database. We use MongoDB's findOneAndUpdate method with the upsert options so it merges the data if it already exists, or creates a new record if not. This way, we start building historical data instead of overwriting it with each update.

    function update (package) {
      var data;
    
      return Q.all([
        getDownloadsForLastFullMonth(package),
        getDownloadsForDate(package, moment().subtract(1,'months')),
      ])
      .then(function(results){
        var dailyResults = results[0];
        var monthlyResults = results[1];
        var deferred = Q.defer();
    
        // build a new object to be merged into the database
        data = {
          'package': package,
          'date_scraped': Date.now(),
          'daily': dailyResults.dates,
          'daily_total': dailyResults.downloads || 0,
        };
    
        // save the monthly download count as a unique field. this way it keeps it over time.
        // but overwrites any existing month incase the data changed.
        // stores data in `month_YEAR_MONTH` format.
        data['month_' + monthlyResults.date.year + '_' + monthlyResults.date.month] = monthlyResults.downloads;
    
        MongoClient.connect(mongodbUrl, function (err, db) {
          var collection = db.collection('downloads');
    
          collection.findOneAndUpdate({'package': package},
                                      data,
                                      {'upsert':true, 'returnOriginal':false},
                                      function (err, record) {
                                        deferred.resolve(record);
                                        db.close();
                                      });
        });
    
        return deferred.promise;
      })
        .then(function(){
          return data;
        })
        .fail(function (err) {
          console.log('err', err);
        });
    };
    

Normalizing and Denormalizing Data

Since we're pulling in data from multiple sources, we want to keep it in separate collections in MongoDB. The NPM package data goes in a collection called npm and the Github data goes in, yes, you guessed it, Github. The NPM download data goes in downloads. This makes it easy to handle updates from any source. However, in order to search on these different fields, we need to denormalize the data into one collection. We're going to use a new collection called packages.

  • When we denormalize the data, we have to create a new ranking field that we can use to sort the results by. This looks at the number of downloads, stars on Github, and the number of days since the last update.

    function calculateRank (data) {
      var downloads = _.get(data, 'downloads.daily_total') || 0;
      var stars = _.get(data, 'github.stargazers_count') || 0;
      var forks = _.get(data, 'github.forks_count') || 0;
      var daysSinceUpdate = (new Date() - new Date(_.get(data, 'modified'))) / MS_PER_DAY;
      var updatedWeight = (daysSinceUpdate < 180) ? 1 : 0;
    
      // handle packages that link to other repos
      // like this: https://www.npmjs.com/package/node-core-lib
      if ((downloads / stars) < 0.3) {
        stars = 0;
        forks = 0;
      };
    
      return (
        ((downloads/MAX.DOWNLOADS) * WEIGHTS.DOWNLOADS) +
        ((stars/MAX.STARS) * WEIGHTS.STARS) +
        ((forks/MAX.FORKS) * WEIGHTS.FORKS) +
        ((updatedWeight) * WEIGHTS.UPDATES)
      );
    };
    
  • You can check out the full source for all the parts here.

Deploying to AppFog

Now that our code is ready, we want to deploy it to a server so we can have it run all the time and constantly update the data. We'll use AppFog's Platform-as-a-Service (PaaS) to handle all the hardware and hosting for us.

  • First, we need to enable AppFog in our CenturyLink Cloud account. There's a detailed tutorial here: Deploy an Application to AppFog
  • Once the account is set up we install the Cloud Foundry CLI. This lets us manage our instances and deploy from the command line. Download the installer from Github.
  • When the CLI is installed, we can login to our account using our username and organization.

    $ cf login -a https://api.useast.appfog.ctl.io -u fox.mulder -o XFILES
    
  • Then, it's easy to deploy to AppFog with a single command.

    $ cf push yourappname
    

This command uses the npm start script in our package.json file.

Summary

That's how we store NMP in Github and MongoDB for easy deployment in AppFog.