Blog

Prometheus and DST

Prometheus only deals with GMT. It does not even try to do something else. But, when you want to compare your business metrics with your usual traffic, you need to take DST into account.

Here is my take on the problem. Note that I am in TZ=Europe/Brussels. We had DST on October 29.

Let’s say that I want to compare one metric with the same metric 2 weeks ago. In this example, the metric would be rate(app_requests_count{env="prod"}[5m]). If we are the 1st of December, we need to look back 14 days. But, if we are the 1st of November, we need to look back 14 days + 1 hour. DST happened on October 29.

To achieve that, we will take advantage of Prometheus recording rules and functions. This example is based on Prometheus 2.0.

First, I setup a record rule that tells me when I need to add an extra hour:

- record: dst
  expr: |
    0*(month() < 12) + 0*(day_of_month() < 13) + 0*(year() == 2017)
  labels:
    when: 14d

That metric dst{when="14d"} will be 0 until the 13th of November, and will have no value otherwise. If you really care, you can play with the hour() function as well.

Then, I create a second rule with two different offset and a or. Note that in a recording group, prometheus computes the rules sequentially.

- record: app_request_rate
  expr: |
    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    or
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )
  labels:
    when: 14d

Let’s analyze this.

The recording rule is split in two parts by a or:

    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )

If the first part does not return any value, then we get the second part.

The second part is easy, so let’s start with it:

  • We sum the 5min rates of app_requests_count, env=prod, 14 days ago.
  • If we get no metrics (e.g. Prometheus was down) we get 0.

The first part is however a bit more complex. Part of it is like the second part, but with an offset of 14d+1h (337h).

Now, to detect if we need the first of the second offset, we add sum(dst{when="14d"}) to the first part. When we need to add an extra hour, then the value of sum(dst{when="14d"}) is 0. Otherwise, there is no value and prometheus falls back to the second part of the rule.

Note: in this rule, the sum in sum(dst{when="14d"}) is here to remove the labels, and allow the + operation.

It is a bit tricky ; but it should do the job. I think I will also in the future create recording rules for day_of_month(), month() and year(), so I can apply an offset to their values.

I will probably revisit this in March 2018…

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 9 November 2017.

Foxyproxy 5.0 is BAD

FoxyProxy devs have deployed a new major release of their addon. It ignores your current config, you need to create a new one. There is no migration path.

Hopefully the new config is still on disk. Simply disable the addon autoupdate and get the 4.6.5 release to get back to work.

I really don’t get why the devs have broken that so useful extension…

Permalink. Category: firefox. Tags: internet.
First published on Wed 6 September 2017.

Custom Sudo command with Ansible

While introducing Ansible at customer, I noticed the following problem while using ansible become feature:

sudo: unable to create /var/log/sudo-io/170718-124212: File exists
sudo: error initializing I/O plugin sudoers_io

That was because in the sudoers configuration, you have:

Defaults syslog=auth,log_input,log_output,iolog_file=%y%m%d-%H%M%S

Which means that sudo sessions will be logged, but there can only be one sudo session per second. While discussing with colleagues what would be the best way to address this – several options are possible: not logging, logging with seq (Defaults:ansible iolog_file=ansible/%{seq}), logging microseconds, … – I have implemented a workaround.

What is interesting is that it uses an undocumented feature: become_exe.

I dropped a file in the ansible repo with the following content:

[privilege_escalation]
become_exe = /bin/sleep 1 && /bin/sudo

And, magically, ansible now waits one second between sudo commands. Enough to let me continue working while looking for the best resolution.

Permalink. Category: automation. Tags: linux security ansible planet-inuits.
First published on Thu 20 July 2017.

A link between DigitalOcean, Packer and Terraform

In the coming months, I will run several workshops about Jenkins. Those workshops will be hands-on, so people will bring their own laptops to hack on Jenkins instances.

For that, I expect lots of them will simply run Jenkins on their laptops, or use an external laptop. But, as a backup, or a primary solution, I plan to provision multiple instances of Jenkins, ready to be used, in the cloud.

Amongst the cloud providers, DigitalOcean is great for this purpose: while it is feature limited, it is simple. Its pricing model is simple too.

In the cloud, you run images. Like I will spin up a lot of them, I want to have those images ready, so I just need to boot them and configure them. The tool for that is Packer. Packer builds images. For DigitalOcean, Docker, AWS, Virtualbox.. It has a great list of providers.

That part was easy. I’ve got my snapshots built quickly.

Then enters Terraform. Terraform can deploy the snapshots built by Packer. Just like this:

resource "digitalocean_droplet" "jenkins" {
  image = "1028374"
  count  = "${var.count}"
  name = "${format("jenkins%02d", count.index + 1)}"
  ssh_keys = [ "${var.do_ssh_key}" ]
  region = "${var.do_datacenter}"
  private_networking = "true",
  size = "512mb"
}

That configuration works but has a problem: the image parameter is an ID. For Snapshots, you can not use name, slugs, you have to use id numbers. This is very annoying, because I to not want the image 1028374, I want the image “jenkins-1.0.0”. Which is the name I defined in Packer.

The workaround

Like lots of problems with open-source software, there is a workaround. Terraform has an external datasource plugin. It can run any script. Here is mine:

#!/bin/bash -e
set -o pipefail
snapshots="$(eval $(jq -r '@sh "doctl -t=\(.api_token) compute snapshot list
\(.id) -o json"'))"
nr="$(echo "$snapshots"|jq '.|length')"
if [[ "$nr" -ne 1 ]]
then
    echo "Expected 1 snapshot, found $nr" >&2
    exit 1
fi
id="$(echo "$snapshots"|jq -r .[0].id)"
jq -r -n --arg id "$id" '{"id":$id}'

In Terraform:

data "external" "jenkins_snapshot" {
  program = ["./do_snapshot_id"]
  query {
    id = "jenkins-${var.do_jenkins_droplet_version}"
    api_token = "${var.do_api_token}"
  }
}

resource "digitalocean_droplet" "jenkins" {
  image = "${data.external.jenkins_snapshot.result.id}"
  count  = "${var.count}"
  name = "${format("jenkins%02d", count.index + 1)}"
  ssh_keys = [ "${var.do_ssh_key}" ]
  region = "${var.do_datacenter}"
  private_networking = "true",
  size = "512mb"
}

That is already better. I can now use the same name as the one I use in Packer. No more meaningless ID’s.

But this approach has multiple problems:

  • Bash scripting is unreadable; it is just 10 lines but is a mess.
  • It introduces jq and doctl as dependencies. So it will not work for everyone.
  • It delegates the digitalocean credentials to multiple processes. I do not like that.

The Solution

I ended up spending some time on an implementation in go. It is now merged en Terraform, which means I had to write documentation and tests, so everyone can use it now. Here is the DigitalOcean image datasource:

data "digitalocean_image" "jenkins" {
  name = "jenkins-${var.do_jenkins_droplet_version}"
}

resource "digitalocean_droplet" "jenkins" {
  image  = "${data.digitalocean_image.jenkins.image}"
  count  = "${var.count}"
  name = "${format("jenkins%02d", count.index + 1)}"
  ssh_keys = [ "${var.do_ssh_key}" ]
  region = "${var.do_datacenter}"
  private_networking = "true",
  size = "512mb"
}

It is a lot better. No external dependencies, bash scripts, and it is available for everyone. And it is shorter and easier to understand.

I hope you will enjoy it. It will be available in the next release of Terraform (0.9.4).

Permalink. Category: cloud. Tags: linux hashicorp.
First published on Fri 21 April 2017.

Running Nightly jobs with Jenkins

Jenkins can spread the load of Jobs by using H instead of * in the cron fields. It means that:

H 3 * * *

Means: Run between 3 and 4 am.

The minute will be decided by Jenkins, by applying a hash function over the job name.

What about this one:

H H * * *

Means: Run once a day. The moment will be calculated by a Jenkins based on the job name.

But what if I have hundreds of jobs, I want to run them once a day, but during night? Something like:

H H(0-5) * * *

Means: Run the job once everyday between 12am and 6am.

But when you scale up your Jenkins, you want the jobs to run between e.g. 7pm and 6am. Because you also want to use the hours before midnight.,

There is a BAD WAY to do it:

H H(0-6),H(19-23) H/2 * *

That would run the jobs in the morning and the evening, every 2 days. It is complex and will not behave correctly at the end of months.

The GOOD WAY to do it is to set the timezone in the cron expression, something which is not documented yet. It is there since Jenkins 1.615 so you probably have it in your Jenkins:

TZ=GMT+7
H H(0-10) * * *

What does this mean? In the timezone GMT+7, run the jobs once between 12am and 11 am. Which means between 7pm and 6am in my timezone.

$ date -d '0:00 GMT+7'
Wed Mar 29 19:00:00 CEST 2017
$ date -d '11:00 GMT+7'
Thu Mar 30 06:00:00 CEST 2017

It is a lot more simple syntax and is more reliable. Please note that the validation (preview) below the Cron settings is not using that TZ (I opened JENKINS-43228 for this).

Permalink. Category: Automation. Tags: Jenkins Testing Planet-inuits.
First published on Thu 30 March 2017.