Playing with the python prometheus query API

The last few days have been a bit icky around here, with my house apparently proudly residing in the major city with the dirtiest air in the world. So, I needed a distraction…

It has also been quite hot, so I wondered how my energy usage was going. I have prometheus monitoring of my power draw, so now seemed as good a time as any to learn how to do some historical querying over the API. I ended up with a python script which can output things like this: Yesterday had a maximum temperature of 38 and we used 28.36 kwh. The average for similar days is 25.56 kwh.”

The code is on github if it is of interest to others. I am sure I could push more of this processing down into the prometheus engine, but I couldn’t see how to do it today. Hints welcome!

Prometheus 2.12, query logging, and startup failures on macos

Prometheus v2.12 added active query logging. The basic idea is that there is a mmaped JSON file that contains all of the queries currently running. If prometheus was to crash, that file would therefore be a list of the queries running at the time of the crash. Overall, not a bad idea.

Some friends had recently added prometheus to their development environments. This is wired up to grafana dashboards for their microservices, and prometheus is configured to store 14 days worth of time series data via a persistent volume from the developer desktops. We did this because it is valuable for the developers to be able to see the history of metrics before and after their changes.

Now we have a developer using macos as their primary development platform, and since prometheus 2.12 it hasn’t worked. Specifically this developer is using parallels to provide the docker virtual machine on his mac. You can summarise the startup for prometheus in the dev environment like this:

$ docker run ...stuff...
...snip...
level=error ts=2019-09-15T02:20:23.520Z caller=query_logger.go:94 component=activeQueryTracker msg="Failed to mmap" file=/prometheus-data/data/queries.active Attemptedsize=20001 err="invalid argument"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fff9917af38, 0x15, 0x14, 0x2a6b7c0, 0xc00003c7e0, 0x2a6b7c0)
	/app/promql/query_logger.go:112 +0x4d2
main.main()
	/app/cmd/prometheus/main.go:361 +0x52bd

And here’s the underlying problem — because of the way the persistent data is mapped into this container (via parallels sharing in this case), the mmap of the active queries file fails and prometheus fails to start.

In other words, since prometheus 2.12 your prometheus data files have to be stored on a filesystem which supports mmap. Additionally, there is no flag to just disable the active query logger.

So how do we work around this? Well, here’s a horrible workaround — in the data directory that is volume mapped into the container, create a symlink that is to a path that is mmapable inside the docker container, even if that path doesn’t exist outside the container. For example, given that we store the prometheus time series at $CONFIG/prometheus-data:

$ ln -s /tmp/queries.active "$CONFIG/prometheus-data/queries.active"

Note that /tmp/queries.active does not exist on the developer’s mac. Prometheus now starts and its puppies and kittens the whole way down.

A pythonic example of recording metrics about ephemeral scripts with prometheus

In my previous post we talked about how to record information from short lived scripts (I call them ephemeral scripts by the way) with prometheus. The example there was a script which checked the SMART status of each of the disks in a machine and reported that via pushgateway. I now want to work through a slightly more complicated example.

I think you hit the limits of reporting simple values in shell scripts via curl requests fairly quickly. For example with the SMART monitoring script, SMART is capable of returning a whole heap of metrics about the performance of a disk, but we boiled that down to a single “health” value. This is largely because writing a parser for all the other values that smartctl returns would be inefficient and fragile in shell. So for this post, we’re going to work through an example of how to report a variety of values from a python script. Those values could be the parsed output of smartctl, but to mix things up a bit, I’m going to use a different script I wrote recently.

This new script uses the Weather Underground API to lookup weather stations near my house, and then generate graphics of the weather forecast. These graphics are displayed on the various Cisco SIP phones I already had around the house. The forecasts look like this:

The script to generate these weather forecasts is relatively simple python, and you can see the source code on github.

My cunning plan here is to use prometheus’ time series database and alert capabilities to drive home automation around my house. The first step for that is to start gathering some simple facts about the home environment so that we can do trending and decision making on them. The code to do this isn’t all that complicated. First off, we need to add the python prometheus client to our python environment, which is hopefully a venv:

pip install prometheus_client
pip install six

That second dependency isn’t a strict requirement for prometheus, but the script I’m working on needs it (because it needs to work out what’s a text value, and python 3 is bonkers).

Next we import the prometheus client in our code and setup the counter registry. At the same time I record when the script was run:

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegistry()
Gauge('job_last_success_unixtime', 'Last time the weather job ran',
      registry=registry).set_to_current_time()

And then we just add gauges for any values we want to add to the pushgateway

Gauge('_'.join(field), '', registry=registry).set(value)

Finally, the values don’t exist in the pushgateway until we actually push them there, which we do like this:

push_to_gateway('localhost:9091', job='weather', registry=registry)

You can see the entire patch I wrote to add prometheus support on github if you’re interested in an example with more context.

Now we can have pretty graphs of temperature and stuff!

Recording performance information from short lived processes with prometheus

Now that I’m recording basic statistics about the behavior of my machines, I now want to start tracking some statistics from various scripts I have lying around in cron jobs. In order to make myself sound smarter, I’m going to call these short lived scripts “ephemeral scripts” throughout this document. You’re welcome.

The promethean way of doing this is to have a relay process. Prometheus really wants to know where to find web servers to learn things from, and my ephemeral scripts are both not permanently around and also not running web servers. Luckily, prometheus has a thing called the pushgateway which is designed to handle this situation. I can run just one of these, and then have all my little scripts just tell it things to add to its metrics. Then prometheus regularly scrapes this one process and learns things about those scripts. Its like a game of Telephone, but for processes really.

First off, let’s get the pushgateway running. This is basically the same as the node_exporter from last time:

$ wget https://github.com/prometheus/pushgateway/releases/download/v0.3.1/pushgateway-0.3.1.linux-386.tar.gz
$ tar xvzf pushgateway-0.3.1.linux-386.tar.gz
$ cd pushgateway-0.3.1.linux-386
$ ./pushgateway

Let’s assume once again that we’re all adults and did something nicer than that involving configuration management and init scripts.

The pushgateway implements a relatively simple HTTP protocol to add values to the metrics that it reports. Note that the values wont change once set until you change them again, they’re not garbage collected or aged out or anything fancy. Here’s a trivial example of adding a value to the pushgateway:

echo "some_metric 3.14" | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/some_job

This is stolen straight from the pushgateway README of course. The above command will have the pushgateway start to report a metric called “some_metric” with the value “3.14”, for a job called “some_job”. In other words, we’ll get this in the pushgateway metrics URL:

# TYPE some_metric untyped
some_metric{instance="",job="some_job"} 3.14

You can see that this isn’t perfect because the metric is untyped (what types exist? we haven’t covered that yet!), and has these confusing instance and job labels. One tangent at a time, so let’s explain instances and jobs first.

On jobs and instances

Prometheus is built for a universe a little bit unlike my home lab. Specifically, it expects there to be groups of processes doing a thing instead of just one. This is especially true because it doesn’t really expect things like the pushgateway to be proxying your metrics for you because there is an assumption that every process will be running its own metrics server. This leads to some warts, which I’ll explain in a second. Let’s start by explaining jobs and instances.

For a moment, assume that we’re running the world’s most popular wordpress site. The basic architecture for our site is web frontends which run wordpress, and database servers which store the content that wordpress is going to render. When we first started our site it was all easy, as they could both be on the same machine or cloud instance. As we grew, we were first forced to split apart the frontend and the database into separate instances, and then forced to scale those two independently — perhaps we have reasonable database performance so we ended up with more web frontends than we did database servers.

So, we go from something like this:

To an architecture which looks a bit like this:

Now, in prometheus (i.e. google) terms, there are three jobs here. We have web frontends, database masters (the top one which is getting all the writes), and database slaves (the bottom one which everyone is reading from). For one of the jobs, the frontends, there is more than one instance of the job. To put that into pictures:

So, the topmost frontend job would be job=”fe” and instance=”0″. Google also had a cool way to lookup jobs and instances via DNS, but that’s a story for another day.

To harp on a point here, all of these processes would be running a web server exporting metrics in google land — that means that prometheus would know that its monitoring a frontend job because it would be listed in the configuration file as such. You can see this in the configuration file from the previous post. Here’s the relevant snippet again:

  - job_name: 'node'
    static_configs:
      - targets: ['molokai:9100', 'dell:9100', 'eeebox:9100']

The job “node” runs on three targets (instances), named “molokai:9100”, “dell:9100”, and “eeebox:9100”.

However, we live in the ghetto for these ephemeral scripts and want to use the pushgateway for more than one such script, so we have to tell lies via the pushgateway. So for my simple emphemeral script, we’ll tell the pushgateway that the job is the script name and the instance can be an empty string. If we don’t do that, then prometheus will think that the metric relates to the pushgateway process itself, instead of the ephemeral process.

We tell the pushgateway what job and instance to use like this:

echo "some_metric 3.14" | curl --data-binary @- http://localhost:9091/metrics/job/frontend/instance/0

Now we’ll get this at the metrics URL:

# TYPE some_metric untyped
some_metric{instance="",job="some_job"} 3.14
some_metric{instance="0",job="frontend"} 3.14

The first metric there is from our previous attempt (remember when I said that values are never cleared out?), and the second one is from our second attempt. To clear out values you’ll need to restart the pushgateway process. For simple ephemeral scripts, I think its ok to leave the instance empty, and just set a job name — as long as that job name is globally unique.

We also need to tell prometheus to believe our lies about the job and instance for things reported by the pushgateway. The scrape configuration for the pushgateway therefore ends up looking like this:

  - job_name: 'pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['molokai:9091']

Note the honor_labels there, that’s the believing the lies bit.

There is one thing to remember here before we can move on. Job names are being blindly trusted from our reporting. So, its now up to us to keep job names unique. So if we export a metric on every machine, we might want to keep the job name specific to the machine. That said, it really depends on what you’re trying to do — so just pay attention when picking job and instance names.

On metric types

Prometheus supports a couple of different types for the metrics which are exported. For now we’ll discuss two, and we’ll cover the third later. The types are:

  • Gauge: a value which goes up and down over time, like the fuel gauge in your car. Non-motoring examples would include the amount of free disk space on a given partition, the amount of CPU in use, and so forth.
  • Counter: a value which always increases. This might be something like the number of bytes sent by a network card — the value only resets when the network card is reset (probably by a reboot). These only-increasing types are valuable because its easier to do maths on them in the monitoring system.
  • Histograms: a set of values broken into buckets. For example, the response time for a given web page would probably be reported as a histogram. We’ll discuss histograms in more detail in a later post.

I don’t really want to dig too deeply into the value types right now, apart from explaining that our previous examples haven’t specified a type for the metrics being provided, and that this is undesirable. For now we just need to decide if the value goes up and down (a gauge) or just up (a counter). You can read more about prometheus types at https://prometheus.io/docs/concepts/metric_types/ if you want to.

A typed example

So now we can go back and do the same thing as before, but we can do it with typing like adults would. Let’s assume that the value of pi is a gauge, and goes up and down depending on the vagaries of space time. Let’s also show that we can add a second metric at the same time because we’re fancy like that. We’d therefore need to end up doing something like (again heavily based on the contents of the README):

cat <<EOF | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/frontend/instance/0
# TYPE some_metric gauge
# HELP approximate value of pi in the current space time continuum
some_metric 3.14
# TYPE another_metric counter
# HELP another_metric Just an example.
another_metric 2398
EOF

And we’d end up with values like this in the pushgateway metrics URL:

# TYPE some_metric gauge
some_metric{instance="0",job="frontend"} 3.14
# HELP another_metric Just an example.
# TYPE another_metric counter
another_metric{instance="0",job="frontend"} 2398

A tangible example

So that’s a lot of talking. Let’s deploy this in my home lab for something actually useful. The node_exporter does not report any SMART health details for disks, and that’s probably a thing I’d want to alert on. So I wrote this simple script:

#!/bin/bash

hostname=`hostname | cut -f 1 -d "."`

for disk in /dev/sd[a-z]
do
  disk=`basename $disk`

  # Is this a USB thumb drive?
  if [ `/usr/sbin/smartctl -H /dev/$disk | grep -c "Unknown USB bridge"` -gt 0 ]
  then
    result=1
  else
    result=`/usr/sbin/smartctl -H /dev/$disk | grep -c "overall-health self-assessment test result: PASSED"`
  fi

  cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/$hostname/instance/$disk
  # TYPE smart_health_passed gauge
  # HELP whether or not a disk passed a "smartctl -H /dev/sdX"
  smart_health_passed $result
EOF
done

Now, that’s not perfect and I am sure that I’ll re-write this in python later, but it is actually quite useful already. It will report if a SMART health check failed, and now I could write an alerting rule which looks for disks with a health value of 0 and send myself an email to go to the hard disk shop. Once your pushgateways are being scraped by prometheus, you’ll end up with something like this in the console:

I’ll explain how to turn this into alerting later.

Basic prometheus setup

I’ve been playing with prometheus for monitoring. It feels quite familiar to me because its based on an internal google technology called borgmon, but I suspect that means it feels really weird to everyone else.

The first thing to realize is that everything at google is a web server. Your short lived tool that copies some files around probably runs a web server. All of these web servers have built in URLs which report the progress and status of the task at hand. Prometheus is built to: scrape those web servers; aggregate the data; store the data into a time series database; and then perform dashboarding, trending and alerting on that data.

The most basic example is to just export metrics for each machine on my home network. This is the easiest first step, because we don’t need to build any software to do this. First off, let’s install node_exporter on each machine. node_exporter is the tool which runs a web server to export metrics for each node. Everything in prometheus land is written in go, which is new to me. However, it does make running node exporter easy — just grab the relevant binary from https://prometheus.io/download/, untar, and run. Let’s do it in a command line script example thing:

$ wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0-rc.1/node_exporter-0.14.0-rc.1.linux-386.tar.gz
$ tar xvzf node_exporter-0.14.0-rc.1.linux-386.tar.gz
$ cd node_exporter-0.14.0-rc.1.linux-386
$ ./node_exporter

That’s all it takes to run the node_exporter. This runs a web server at port 9100, which exposes the following metrics:

$ curl -s http://localhost:9100/metrics | grep filesystem_free | grep 'mountpoint="/data"'
node_filesystem_free{device="/dev/mapper/raidvg-srvlv",fstype="xfs",mountpoint="/data"} 6.811044864e+11

Here you can see that the system I’m running on is exporting a filesystem_free value for the filesystem mounted at /data. There’s a lot more than that exported, and I’d encourage you to poke around at that URL a little before continuing on.

So that’s lovely, but we really want to record that over time. So let’s assume that you have one of those running on each of your machines, and that you have it setup to start on boot. I’ll leave the details of that out of this post, but let’s just say I used my existing puppet infrastructure.

Now we need the central process which collects and records the values. That’s the actual prometheus binary. Installation is again trivial:

$ wget https://github.com/prometheus/prometheus/releases/download/v1.5.0/prometheus-1.5.0.linux-386.tar.gz
$ tar xvzf prometheus-1.5.0.linux-386.tar.gz
$ cd prometheus-1.5.0.linux-386

Now we need to move some things around to install this nicely. I did the puppet equivalent of:

  • Moving the prometheus file to /usr/bin
  • Creating an /etc/prometheus directory and moving console_libraries and consoles into it
  • Creating a /etc/prometheus/prometheus.yml config file, more on the contents on this one in a second
  • And creating an empty data directory, in my case at /data/prometheus

The config file needs to list all of your machines. I am sure this could be generated with puppet templating or something like that, but for now here’s my simple hard coded one:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'stillhq'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['molokai:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['molokai:9100', 'dell:9100', 'eeebox:9100']

Here you can see that I want to scrape each of my web servers which exports metrics every 15 seconds, and I also want to calculate values (such as firing alerts) every 15 seconds too. This might not scale if you have bajillions of processes or machines to monitor. I also label all of my values as coming from my domain, so that if I ever aggregate these values with another prometheus from somewhere else the origin will be clear.

The other interesting bit for now is the scrape configuration. This lists the metrics exporters to monitor. In this case its prometheus itself (molokai:9090), and then each of my machines in the home lab (molokai, dell, and eeebox — all on port 9100). Remember, port 9090 is the prometheus binary itself and port 9100 is that node_exporter binary we now have running on all of our machines.

Now if we start prometheus, it will do its thing. There is some configuration which needs to be passed on the command line here (instead of in the configration file), so my command line looks like this:

/usr/bin/prometheus -config.file=/etc/prometheus/prometheus.yml \
    -web.console.libraries=/etc/prometheus/console_libraries \
    -web.console.templates=/etc/prometheus/consoles \
    -storage.local.path=/data/prometheus

Prometheus also presents an interactive user interface on port 9090, which is handy. Here’s an example of it graphing the load average on each of my machines (it was something which caused a nice jaggy line):

You can see here that the user interface has a drop down for selecting values that are known, and that the key at the bottom tells you things about each time series in the graph. So for example, if we added {instance=”eeebox:9100″} to the end of the value in the text box at the top, then we’d be filtering for values with that label set, and would as a result only show one value in the graph (the one for eeebox).

If you’re interested in very simple dashboarding of basic system metrics, that’s actually all you need to do. In my next post about prometheus I’m going to show how to write your own binary which exports values to be graphed. In my case, the temperature outside my house.