Ever wondered what "sort by relevance" really means in search results? Maybe you've been frustrated that they aren't actually relevant. Now you can do something about it, at least with your own applications. When we created a way to search Node modules, sorting was on our minds. We wanted the best Node modules to show up at the top of the results, not those with the name that most closely matches the search term. There are many different ways to look at relevance, and in this tutorial you'll see how to apply your own view to create your own search ranking.

The Problem With Sort Scoring

First, let's consider how search works by default in Orchestrate, our NoSQL database-as-a-service. The search features are built on top of Elasticsearch, and Orchestrate intelligently makes all of the indexing and schema choices for you. When executing a search with no sort specified, Orchestrate uses the search score provided by Elasticsearch. The default score is great for many situations where the goal is to closely match a search term. That method is too simplistic when there are other factors available for sorting.

To build ScoutJS, we merged data from NPM with GitHub. Among the fields we cared about from GitHub were stars, forks, and updates. NPM had all the data about each module, as well as a valuable popularity metric--number of downloads. We needed to include all of these factors in our rankings.

As a way of showing the difference, let's look at the example of searching for "auth" in our combined dataset. The first result is the "auth" module. That may seem highly relevant until you realize it was last updated three years ago. It has other signs of non-popularity as well--no forks, one star, and relatively few downloads.

But how is Orchestrate or Elasticsearch supposed to know?

Determine Your Sort Factors

Every application is different. Your input is needed to identify what matters in your data. You can use the same approach we did with ScoutJS in your applications.

If you look at the same search for "auth" using our finished search ranking algorithm, you'll see results that are relevant--not due to name matching, but popularity.

Search is ranked by data relevance, not search term

The top results have hundreds of forks, thousands of stars, and hundreds of thousands of downloads.

Clearly, the sort factors for ScoutJS are:

  • Downloads
  • Stars
  • Forks
  • Updates

What's important to your application? Jot down some potential fields in your dataset that you want to use for sort ranking. You only need a handful, and they should jump out as obviously important. However, they don't need to be equally important, as we'll see in the next section.

Apply Weights and Adjust as Needed

Now that you've determined which fields in your data count for sorting, you need to decide how much they count. We'll use these weights along with each of the sort factors to determine a popularity value upon which we can sort. We'll add that pre-calculated value to our data so it's ready when we need it.

Here's the section of ScoutJS code where the rank is calculated:

function calculateRank (data) {
  var downloads = _.get(data, 'downloads.daily_total') || 0;
  var stars = _.get(data, 'github.stargazers_count') || 0;
  var forks = _.get(data, 'github.forks_count') || 0;
  var daysSinceUpdate = (new Date() - new Date(_.get(data, 'modified'))) / MS_PER_DAY;
  var updatedWeight = (daysSinceUpdate < 180) ? 1 : 0;

  // handle packages that link to other repos
  // like this: https://www.npmjs.com/package/node-core-lib
  if ((downloads / stars) < 0.3) {
    stars = 0;
    forks = 0;

  return (
    ((stars/MAX.STARS) * WEIGHTS.STARS) +
    ((forks/MAX.FORKS) * WEIGHTS.FORKS) +
    ((updatedWeight) * WEIGHTS.UPDATES)

The first thing we do is store the factors into variables. Next, we create a new variable, updatedWeight, where we make our first decision about how much each factor matters. We decided that we cared only if a module was updated within the last six months or so. If it's older than six months, it gets no credit for being updated. Otherwise, it does get credit.

We also realized that modules that have very few downloads, but have many stars, are anomalies that should be dropped in the rankings. So, we reset stars and forks to be zero.

Lastly, the weights are applied based on how close each module is to the maximum value for its field. The weights and maximums themselves are static values declared elsewhere:

var WEIGHTS = {
  STARS: 0.8,
  FORKS: 0.4,
  UPDATES: 0.2,

var MAX = {
  DOWNLOADS: 2000000,
  STARS: 10000,
  FORKS: 10000,

We tried various values before arriving at these weights. You'll do the same thing with your data, because you know it best. If something seems amiss, adjust your weights, or even try removing some factors. They might not be as important as you thought.

Denormalization and Planning Ahead

Remember that ScoutJS merges disparate datasets, which themselves are always updating. Periodically we grab the latest downloads from NPM and popularity data from GitHub. Each of those is stored separately, which is "normalized data" in database terms. Then our application "denormalizes" the data into a central collection. That's when our ranking is calculated.

Wondering why we don't calculate rankings or join the data on the fly? That's a perfectly natural question, especially if your background is mostly relational databases. It can be a little strange to transition from relational to NoSQL databases.

Our applications benefit in several ways from planning ahead. First, we're able to update our data from different sources at different times. We also gain the ability to look only at the data from GitHub, for example, and to look back at its changes by using Orchestrate's powerful Refs feature. Finally, we see performance gains by calculating the static ranking value a single time and storing that value for usage when we need it.

Now That We've Got That Sorted...

Hopefully you've been thinking about your own data. Are there ways you can sort it in a more relevant way? Do you know which fields of data matter in your applications? Use this search ranking algorithm approach to put some weight into your results.