With the advent of the 'micro-services' architecture and the evolving trend for using Docker, monolithic applications are being broken up into smaller and independent services. The idea is to keep the services small so that small groups of developers can work on them, upgrade or patch them quickly, and build & release them continuously. Although that vision is promising, it introduces complexity as the number of services grow. With that also grows the need to monitor these services around the clock, to maintain the healthy functioning of the application.

Overview

Here we look at Prometheus, and demonstrate its capabilities by using it to monitor Panamax and its containerized services. Written in Go, Prometheus, is a open-source monitoring service and alerting toolkit build at SoundCloud. It boasts of a variety of features and components that made it really interesting for me to evaluate it internally at CenturyLink Labs.

Architecture

Prometheus was written from the ground up, based on real use cases and experiences at SoundCloud, designed to tackle real problems faced in real production systems.

prometheus architecture

[Pic Courtesy: prometheus.io]

In the heart of the system is the Prometheus server, backed up by a local database server. Prometheus is based on a 'pull' mechanism, that scrapes metrics from the configured targets. However, for short-lived jobs, it provides an intermediary push gateway for scraping metrics. It also provides PromDash, a visualization dashboard for the collected data, an Expression browser with a query language to ease filtering of data, and an AlertManager to send notifications based on triggered alerts based on an alert rules engine. You can find more resources on their media page.

Monitoring Panamax

So to give Prometheus a whirl, I decided to monitor Panamax and its services. The goals were:

  • Setup Prometheus
  • Configure it to monitor Panamax services -- use the existing cAdvisor endpoint exposed by Panamax -- use the 'container-exporter' provided by Prometheus
  • Run all components as Dockerized services
  • Manage alerts and notifications -- configure alert rules -- setup AlertManager to send notifications to Hipchat
  • Visualize and query metrics on Prometheus GUI
  • Receive notifications on Hipchat So without further ado, let's get on with it.

Note: I am assuming that you have a working Docker installation and a working installation of Panamax on your machine, if you want to follow along.

Setup & Configuration

The goal was to run Prometheus as a Docker service although it can be installed as a binary from the available releases, or built from source. Luckily, all the Prometheus services are available as Docker images.

Configure Prometheus

In preparation to run Prometheus, we have to create a configuration file named prometheus.yml that allows setting up of jobs and targets for scraping. Create a folder named prometheus and create a new yaml file named prometheus.yml with the contents shown below.

$ mkdir prometheus && cd prometheus && touch prometheus.yml
## prometheus.yml ##

global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these extra labels to all time-series collected by this Prometheus instance.
  labels:
    monitor: 'panamax-monitor'

rule_files:
  - '/etc/prometheus/alert.rules'

# A scrape configuration containing exactly one endpoint to scrape:
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  # Panamax
  - job_name: 'panamax'
    scrape_interval: 5s

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    scrape_timeout: 10s

    target_groups:
      - targets: ['10.0.0.200:3002']
        labels:
          group: 'development'

The global section describes and overrides some defaults. The labels section, attaches a specific label to this instance of the Prometheus server. The rule_files section lists all rule files (recording or alert rules) that Prometheus needs to load and process. We will look at the rule file described here at a later time. The scrape_configs section, describes the job(s) that Prometheus needs to process. In our case, we have a job named panamax, with some config items, including the target_groups sub-section. We add a target that points to the cAdvisor address running as part of the Panamax application. Note: The IP address 10.0.0.200 is my Panamax address also aliased as panamax.local.

Configure cAdvisor for Prometheus

If you are already using cAdvisor, version 0.11.0 and above has Prometheus integration. Prometheus can leverage the host and container level metrics exposed by cAdvisor. To see the metrics exposed by cAdvisor, go to:

# cAdvisor metrics endpoint
10.0.0.200:3002/metrics

Since in the prometheus.yml configuration we specified the target as the cAdvisor address, Prometheus will automatically look for the /metrics endpoint, to expose the metrics.

Run container-exporter as a service

If you looking to capture host and container level metrics, Prometheus also provides a container-exporter, that can be run side by side to your other docker services. Many other exporters and third-party integrations are also provided by Prometheus.

docker run -d --name PROM_CON_EXP \
              -p 9104:9104 \
              -v /sys/fs/cgroup:/cgroup \
              -v /var/run/docker.sock:/var/run/docker.sock \
              prom/container-exporter

In this case, the target_groups section of the prometheus.yml file will have a target that points to the address of the 'container-exporter' like so:

...
     target_groups:
       - targets: ['10.0.0.200:9104']
         labels:
           group: 'development'
...

This tells Prometheus to leverage the host and container level metrics exposed by the 'container-exporter' service. To see the metrics exposed by the 'container-exporter' service, go to:

# 'container-exporter' service metrics endpoint #
10.0.0.200:9104/metrics

Note: You only need one target that exposes metrics for your application. In our case, we are using cAdvisor for collecting host and container level metrics for Panamax.

Setup Alerts and Notifications

Various alert rules can be configured within Prometheus, to detect events that happen based on metric counters that Prometheus tracks. To send notifications based on these alerts, the AlertManager component is used. An AlertManager instance can be configured via the alertmanager.url flag while starting Prometheus, thus enabling notifications to be sent when alerts are triggered. To start off, I wanted to set up a simple alert that detects if Panamax application is down, and notifies me on my Hipchat room.

Add alert rules to Prometheus

To configure an alert in Prometheus, we need to create an alert rules file. Create a new text file named alert.rules with the contents shown below.

$ cd prometheus && touch alert.rules
## alert.rules ##

# Alert for any instance that is unreachable for >5 minutes.
ALERT pmx_down
  IF up == 0
  FOR 5m
  WITH {
    severity="page"
  }
  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."

Here we are setting up a alert named pmx_down, which specifies a condition up == 0 using the IF clause, and the FOR clause specifying that the alert will be triggered after 5m that the condition remains true. In other words, if Panamax is down for 5m, this alert will be triggered. The WITH clause attaches an additional label of severity="page" to the alert. The SUMMARY and the DESCRIPTION clauses are self-explanatory, but we will soon see that the text in the SUMMARY clause is what gets written as the notification text on Hipchat.

Configure AlertManager

Adding an alert as we did above, sets up Prometheus to trigger an alert when conditions are met, but to send notifications, Prometheus relies on the AlertManager component. So, lets set that up so we can send notifications to Hipchat, when our alert is triggered. To do so, we need to create a configuration file. Create a new text file named alertmanager.conf with the contents shown below.

$ cd prometheus && touch alertmanager.conf
## alertmanager.conf ##

notification_config {
  name: "alertmanager_hipchat"
  hipchat_config {
    auth_token: "<hipchat_token_here>"
    room_id: 123456
    send_resolved: true
  }
}

aggregation_rule {
  repeat_rate_seconds: 3600
  notification_config_name: "alertmanager_hipchat"
}

We are setting up a notification_config for Hipchat, with some specific keys required by Hipchat. The send_resolved setting is used to trigger an additional notification when the alert condition is 'resolved'. in our case, it would be when the Panamax application is back up. The aggregation_rule sets up an attribute repeat_rate_seconds which configures the notifications to be repeated for the specified duration in seconds. In our case, we want the notifications to be repeated every 2 hours while the Panamax application is down. The notifications are stopped when the alert condition is no longer met or the alert is manually silenced from the Prometheus UI.

Run AlertManager as a service

Now that we have a configuration for the AlertManager, we can run it is as a container service, passing in the alertmanager.conf via the config.file flag.

docker run -d -p 9093:9093
              -v $PWD/alertmanager.conf:/alertmanager.conf \
              prom/alertmanager \
              -config.file=/alertmanager.conf

And, we can see our container running:

CONTAINER ID IMAGE             CREATED         PORTS
bd947de3d58c prom/alertmanager 22 hours ago    0.0.0.0:9093->9093/tcp

Note: We will record the port where the AlertManager is running as we need it in the next section.

Run Prometheus as a service

With the prometheus.yml setup, the metrics endpoint setup, the alert rules setup and the AlertManager configuration setup, we can finally run the Prometheus server as a container service. As soon as the service starts, it will start scraping the metrics, and make it available on the Prometheus UI.

docker run -d -p 9090:9090 \
              -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
              -v $PWD/alert.rules:/etc/prometheus/alert.rules \
              prom/prometheus \
              -config.file=/etc/prometheus/prometheus.yml \
              -alertmanager.url=http://192.168.59.103:9093

We expose the Prometheus UI at port 9090, and volume mount the local prometheus.yml file & alert.rules file to /etc/prometheus/prometheus.yml, where it is picked up by Prometheus. We also pass the configuration file path via the config.file flag and pass the alert manager url via the alertmanager.url flag.

Note: The IP address http://192.168.59.103 is my Docker Host address. And, we can see our container running:

CONTAINER ID IMAGE           CREATED          PORTS
34af30279267 prom/prometheus 22 hours ago     0.0.0.0:9090->9090/tcp

This completes our setup and configuration, resulting in running Prometheus server and the AlertManager, both as container services.

Prometheus UI and Querying

Now, that we are running Prometheus and scraping metrics off Panamax application, we can head over to the Prometheus UI, to visualize the metrics and query them.

Note: There is a separate component PromDash, which is more elaborate Prometheus dashboard, that I talk about at the end of the article.

Prometheus UI

The Prometheus UI is available at your Docker Host address on port 9090. Click on the 'Graph' menu item to open the Expression Browser.

PromDash

The above screenshot shows you the metric counters that were picked up by Prometheus exposed by cAdvisor.

Querying

In the query field, paste the following query, and hit 'Execute'. Then click on the 'Graph' tab, to see the visualization of metrics for memory_usage_bytes counter for the PMX_UI container.

container_memory_usage_bytes{instance="10.0.0.200:3002",job="panamax", name="PMX_UI"}

PromDash_PMX_UI

Next, click on the 'Add Graph' button, and paste the following query, and hit 'Execute'. Then click on the 'Graph' tab, to see the visualization of metrics for memory_usage_bytes counter for the PMX_API container.

container_memory_usage_bytes{instance="10.0.0.200:3002",job="panamax", name="PMX_API"}

PromDash_PMX_API

Next, click on the 'Add Graph' button, and paste the following query, and hit 'Execute'. Then click on the 'Graph' tab, to see the visualization of metrics for memory_usage_bytes counter for the WP container. The WP container was actually started by Panamax. Here you can see 5 instances of the WP container starting/stopping at different points in time.

container_memory_usage_bytes{instance="10.0.0.200:3002",job="panamax",name="WP"}

PromDash_WP Note: You can toggle the 'duration' parameter to zoom in/out on the data points across time.

Status

Click on the 'Status' menu item to see the runtime/build information, configuration, rules, targets and startup flags that are active for the Prometheus server. PromDash_Status

Alerts and Notifications We had setup an alert in Prometheus and configured notifications to be sent to Hipchat if Panamax was down. Lets test it out.

Triggering Alerts

On the Prometheus UI, go to the 'Alerts' menu, and you will see the pmx_down alert inactive and green in color. If you click on it, you can see the actual alert condition that we had setup earlier.

PromDash_Alerts

To trigger this alert we need to shutdown Panamax. So, lets do that now by doing panamax pause. If you click on the 'Alerts' menu, you will see that the alert has now become active, is red in color and the State shows as firing. PromDash_AlertFiring You can also open up the AlertManager at http://192.168.59.103:9093, to see the alerts that have been triggered.

AlertManager And, view the API endpoint for the AlertManager at http://192.168.59.103:9093/api/alerts. AlertManager_API

Note: The IP address http://192.168.59.103 is my Docker Host address.

Getting Notifications

The expectation is to receive a notification on Hipchat and we do so immediately as shown below. The notifications are repeated every 2 hours till Panamax comes back up.

PromDash_Alert_Hipchat

When Panamax is back up, the alert is deemed resolved and a new notification to that effect is sent to Hipchat. The alert status is inactive and green in color again.

PromDash_Alerts_Hipchat_Res

Update:

Based on a few requests, I have created a docker-compose.yml file in Lorry.io for the deployment of the above setup. Note, that the PromDash setup is not included as it needs some manual setup steps.

Prometheus Dashboard

Prometheus also comes with a graphical dashboard named PromDash. Let's setup PromDash as a container service.

Creating a local Sqlite3 database

PromDash needs a database to store its data, so let's create a local file based Sqlite3 database for simplicity.

$ cd prometheus
$ sqlite3
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open promdash.sqlite3
sqlite> .databases
seq name file
--- --------------- ----------------------------------------------------------
0 main /my/path/prometheus/promdash.sqlite
sqlite> .exit

Now, that we have our database created, we need to configure the database with the schema. Since PromDash is a Rails application, we will just run the db migrations.

docker run -v $PWD:/tmp/prom \
           -e DATABASE_URL=sqlite3:/tmp/prom/promdash.sqlite3 \
           prom/promdash \
           ./bin/rake db:migrate

And, now that the database is all setup, let's run PromDash UI as a container, on port 4000.

docker run -d -p 3000:4000 \
              -v $PWD:/tmp/prom \
              -e DATABASE_URL=sqlite3:/tmp/prom/promdash.sqlite3 \
              prom/promdash

And, we can see our container running:

CONTAINER ID IMAGE             CREATED         PORTS
ee6275f1b625 prom/promdash     22 hours ago    3000/tcp, 0.0.0.0:3000->4000/tcp

We can now head over to http://192.168.59.103:4000/ to use the PromDash UI.

Note: The IP address http://192.168.59.103 is my Docker Host address. Without getting into details, here is what my PromDash UI looks like:

PromDash_Dashboard

Conclusion

In summary, we looked at running a Prometheus server, configured a metric scraping target, created alerts, enabled notifications to Hipchat, and ran an AlertManager. We then looked at the Prometheus UI and PromDash, to visualize the collected data and performed query operations on them. We triggered alerts and got notifications on Hipchat, by shutting down our monitored application. Prometheus is an excellent monitoring service and alerting toolkit, that could help you better monitor your applications and its containerized services. We have just scratched the surface in this article, but you should find Prometheus able to handle most monitoring scenarios.