We all know the Orchestrate API is great for storing structured data in the cloud. And we all know that the search API gives you an incredibly powerful query language for filtering records and sorting results. But until now, there’s been no good way to see the big picture or summarize the data in an Orchestrate collection. Today we’re launching a family of Aggregate Functions so that you can finally see the big picture, allowing you to build dashboard charts and graphs, faceted product exploration, or anything else that relies upon a birds eye view of your data.

You can watch an introductory screencast (above) to get the three minute version, or read on to learn about my fictional (or is it?) transportation company.

Let’s pretend I’ve launched a self-driving taxi service. Something like Uber or Lyft, but with a fleet of autonomous robot cars instead of human drivers. Every time a passenger completes their travel, the system inserts a new “trip” record into an Orchestrate database:

{
  "vehicle_id" : "ABCD1234",
  "passenger_id" : "BenjiSmith",
  "start_time" : "2014-12-01T18:06:45.123",
  "duration_minutes" : 10.625,
  "geo" : {
    "start" : {
      "lat" : 45.525415,
      "lon" : -122.673274
    },
    "end" : {
      "lat" : 45.519136,
      "lon" : -122.706780
    }
  }
  "distance_mi" : 2.502,
  "fare_usd" : 5.03
}

With this kind of data, I can easily use the Orchestrate search API to find all trips for a particular passenger, or a particular vehicle. I can search for all trips lasting less than ten minutes or traveling more than five miles.

But for each of those queries, the result is always a list of individual trips. As the founder of this company, I want to see the big picture! How much total revenue did my swarm of robot cars generate within the last 24 hours? How many total miles have they traveled this week? How are the trips geographically distributed?

The Orchestrate family of Aggregate Functions includes four different flavors:

Statistical Aggregates

This aggregate function provides a statistical summary of the numeric values in a particular field. For example, I could calculate a statistical summary of the fare_usd field using the Orchestrate REST API, like this:

curl -i https://api.orchestrate.io/v0/trips
        ?query=*
        &aggregate=value.fare_usd:stats

The aggregate parameter specifies the fully-qualified name of the field, followed by a colon and the name of the aggregate type (in this case, “stats”). The results look exactly like normal Orchestrate search results, except now there’s a new “aggregates” field:

{
  "count": 10,
  "total_count": 40473,
  "results": [
    // Result items omitted for brevity
  ],
  "aggregates": [{
    "aggregate_kind": "stats",
    "field_name": "value.fare_usd",
    "value_count": 40473,
    "statistics": {
      "min": 1.82,
      "max": 156.40,
      "mean": 11.63989944,
      "sum": 471101.65,
      "sum_of_squares": 8817584.179,
      "variance": 82.37814657,
      "std_dev": 9.076240773
    }
  }]
}

This aggregate gives me lots of useful summary metrics: my cars have made a total of 40,473 trips, earning a sum of $471,101.65 in revenue. The minimum fare anyone paid was $1.82, and the maximum fare was $156.40, while an average trip cost the passenger about $11.64.

This particular aggregate applies to my entire database of trips, since I coupled the aggregate with a match-all query (*) in my search request. But what if I used a range query instead? If I query against the value.start_time field, I can get a statistical summary for a single one-day period, like this:

curl -i https://api.orchestrate.io/v0/trips
        ?query=value.start_time:[2014-12-01 TO 2014-12-02]
        &aggregate=value.fare_usd:stats

Not only will the result items be filtered according to the date range query, but so will the result items underlying the aggregate, which now looks like this:

{
  "count": 10,
  "total_count": 770,
  "results": [
    // Result items omitted for brevity
  ],
  "aggregates": [{
    "aggregate_kind": "stats",
    "field_name": "value.fare_usd",
    "value_count": 770,
    "statistics": {
      "min": 2.11,
      "max": 44.89,
      "mean": 11.37687013,
      "sum": 8760.19,
      "sum_of_squares": 143119.8785,
      "variance": 56.51018798,
      "std_dev": 7.517325853
    }
  }]
}

These results tell me that the fares on December 1st ranged between $2.11 and $44.89, with an average of about $11.38, and a grand total of $8,760.19 in revenue for the day.

The most important thing to remember about aggregates is that they always always reflect the subset of database items matched by the query. So you can combine queries and aggregates to create summarized views of different subsets of your collections.

Range Aggregates

Range aggregates also provide a big-picture summary of the numerical data in your collection. But unlike statistical aggregates, range aggregates allow you to configure a collection of numeric ranges, and the system will count the number of field values in your collection that fall within the designated ranges.

For example, if I wanted to build a dashboard chart showing me a histogram of trip durations (in minutes), this is how the API call would look:

curl -i https://api.orchestrate.io/v0/trips
        ?query=*
        &aggregate=value.duration_minutes:range:*~5:5~10:10~30:30~*

Just like before, the aggregate parameter specifies the fully-qualified name of the field, followed by a colon and the name of the aggregate type (“range”). But range aggregates also expect a list of numeric ranges, separated by colons and using tilde characters between min and max values. Asterisks represent ranges with open boundaries.

In this particular example, we’ve created four different buckets: all rides taking less than five minutes, between five and ten minutes, between ten and thirty minutes, and greater than thirty minutes.

The results look like this:

{
  "count": 10,
  "total_count": 40473,
  "results": [
    // Result items omitted for brevity
  ],
  "aggregates": [{
    "aggregate_kind": "range",
    "field_name": "value.duration_minutes",
    "value_count": 40473,
    "buckets": [{
      "max": 5,
      "count": 3724
    },{
      "min": 5,
      "max": 10,
      "count": 22019
    },{
      "min": 10,
      "max": 30,
      "count": 13867
    },{
      "min": 30,
      "count": 863
    }]
  }]
}

From these results, we can see that over half of all trips (22,019 out of 40,473) take between five and ten minutes. We could create a bar chart with these results, which would look something like this:

orchestrate-range-aggregates

Distance Aggregates

Distance aggregates let you count how many geo-locations in your dataset fall within a set of distance-range buckets from a central location. Unlike statistical and range aggregates, which operate upon raw numerical values in your data, distance aggregates operate only upon geospatial data.

For example, I could build a map with a bulls-eye chart showing how many trips started and stopped within various distance ranges of a central location. Here’s how the query would look:

curl -i https://api.orchestrate.io/v0/trips
        ?query=value.geo.start:NEAR:{lat:45.5 lon:-122.6 dist:100mi}
        &aggregate=value.geo.start:distance:*~1:1~3:3~5:5~*

It’s important to remember that distance aggregates can only be used when your query includes a NEAR clause, which provides the central anchor-point for your distance bulls-eye. For this particular query, the results would look like this:

{
  "count": 10,
  "total_count": 40473,
  "results": [
    // Result items omitted for brevity
  ],
  "aggregates": [{
    "aggregate_kind": "distance",
    "field_name": "value.geo.start",
    "value_count": 40473,
    "buckets": [{
      "max": 1,
      "count": 25743
    },{
      "min": 1,
      "max": 2,
      "count": 10252
    },{
      "min": 2,
      "max": 3,
      "count": 4387
    },{
      "min": 3,
      "count": 91
    }]
  }]
}

In this example, 25,743 trips had a start-point within 1 mile of the centroid (45.5 lat, -122.6 lon), and there were only 91 trips with start points more than 5 miles away. The bulls-eye chart on the map would look something like this:

orchestrate-distance-aggregates

Time Series Aggregates

Time series aggregates let you count how many date-time values in your dataset fall within various chronological intervals. Unlike the other kinds of aggregates, which operate upon raw numerical values or geo-locations in your data, time series aggregates operate only upon chronological data.

For example, I could draw a time-series chart showing the number of trips per day, over the past 30 days. Here’s how the query would look:

curl -i https://api.orchestrate.io/v0/trips
        ?query=value.start_time:[2014-11-01 TO 2014-12-01]
        &aggregate=value.start_time:time_series:day

Time series aggregates take one parameter indicating the bucketing interval. The possible values for this parameter are year, quarter, month, week, day, and hour. For this particular query, which uses the “day” bucketing interval, the results would look like this:

{
  "count": 10,
  "total_count": 40473,
  "results": [
    // Result items omitted for brevity
  ],
  "aggregates": [{
    "aggregate_kind": "time_series",
    "field_name": "value.start_time",
    "interval": "day",
    "value_count": 40473,
    "buckets": [{
      "bucket": "2014-11-01",
      "count": 482
    },{
      "bucket": "2014-11-02",
      "count": 309
    },{
      // Buckets omitted for brevity
    },{
      "bucket": "2014-11-29",
      "count": 898
    },{
      "bucket": "2014-11-30",
      "count": 776
    }]
  }]
}

We could plot the results on a time-series chart, which would look like this:

orchestrate-time-aggregates

You Can Use Them Now

All four kinds of aggregates (statistical, range, distance, and time series) are available now on the Orchestrate API.