Blog

Prometheus Google Compute Engine discovery example

Here is a small example on how to use Prometheus to scrape your GCE instance.

I recommend you to look at prometheus documentation to see how you can pass the crendentials to your prometheus instance.

scrape_configs:
  - job_name: node_gce
    gce_sd_configs:
      - zone: europe-west1-b
        project: myproject
      - zone: europe-west1-d
        project: myproject
      - zone: europe-west1-c
        project: myproject
    relabel_configs:
      - source_labels: [__meta_gce_public_ip]
        target_label: __address__
        replacement: "${1}:9090"
      - source_labels: [__meta_gce_zone]
        regex: ".+/([^/]+)"
        target_label: zone
      - source_labels: [__meta_gce_project]
        target_label: project
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
      - regex: "__meta_gce_metadata_(.+)"
        action: labelmap

Let’s analyze it.

Zones and projects

    gce_sd_configs:
      - zone: europe-west1-b
        project: project1
      - zone: europe-west1-d
        project: project1
      - zone: europe-west1-c
        project: project2

We have a job named node_gce, which has 3 gce_sd_config objects. One object is attached to one zone and one project.

Relabeling

Setting the address

This example will substitute the private ip by the public ip of your node, and use the port 9090. __address__ is a hidden used by prometheus to get the address to scrape.

      - source_labels: [__meta_gce_public_ip]
        target_label: __address__
        replacement: "${1}:9090"

Zones and project

Now, let’s get automatically a zone label, which will match the gce zone:

      - source_labels: [__meta_gce_zone]
        regex: ".+/([^/]+)"
        target_label: zone

Let’s get a project label, too:

      - source_labels: [__meta_gce_project]
        target_label: project

Instance name

And a human readable instance name, that will match gce instance name:

      - source_labels: [__meta_gce_instance_name]
        target_label: instance

Metadata

The last part of the config will make every metadata of the instance a label in prometheus:

      - regex: "__meta_gce_metadata_(.+)"
        action: labelmap
Permalink. Category: monitoring. Tags: prometheus.
First published on Sat 9 March 2019.

Dealing with flapping metrics in prometheus

Prometheus allows you to get metrics from a lot of systems.

We are integrated with third party suppliers that expose us a balance, an amount of resources we can use.

That is exposed as the following metric:

available_sms{route="1",env="prod"} 1000

This is a gauge, therefore we can write an alerting rule like this:

- alert: No more SMS
  expr: |
    available_sms < 1000

That works well.. when the provider API is available. In our use case, sometimes, the api is refusing access for 10 minutes. Which means that if our balance is below 1000 we will get two tickets as the alert will start twice.

An alternative would be to do:

- alert: No more SMS
  expr: |
    max_over_time(available_sms[1h]) < 1000

Picking min_over_time means that the alert will be resolved only one hour after the original result. max_over_time means that the alert will be triggered one hour too late.

We use an alternative approach, which is to record the last known value:

- record: available_sms_last
  expr: available_sms or available_sms_last
- alert: No more SMS
  expr: |
    available_sms_last < 1000
- alert: No more SMS balance
  expr: |
    absent(available_sms)
  for: 1h

That rule will ensure that in case the api is not available, the available_sms_last metric will contain the last known value. We can therefore alert on that, without alerting too soon or too late! This is using prometheus 1-to-1 vector matching.

Another alert, on absent(available_sms) enables us to know when the api is down for a long time.

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 21 February 2019.

Prometheus and DST

Prometheus only deals with GMT. It does not even try to do something else. But, when you want to compare your business metrics with your usual traffic, you need to take DST into account.

Here is my take on the problem. Note that I am in TZ=Europe/Brussels. We had DST on October 29.

Let’s say that I want to compare one metric with the same metric 2 weeks ago. In this example, the metric would be rate(app_requests_count{env="prod"}[5m]). If we are the 1st of December, we need to look back 14 days. But, if we are the 1st of November, we need to look back 14 days + 1 hour. DST happened on October 29.

To achieve that, we will take advantage of Prometheus recording rules and functions. This example is based on Prometheus 2.0.

First, I setup a record rule that tells me when I need to add an extra hour:

- record: dst
  expr: |
    0*(month() < 12) + 0*(day_of_month() < 13) + 0*(year() == 2017)
  labels:
    when: 14d

That metric dst{when="14d"} will be 0 until the 13th of November, and will have no value otherwise. If you really care, you can play with the hour() function as well.

Then, I create a second rule with two different offset and a or. Note that in a recording group, prometheus computes the rules sequentially.

- record: app_request_rate
  expr: |
    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    or
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )
  labels:
    when: 14d

Let’s analyze this.

The recording rule is split in two parts by a or:

    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )

If the first part does not return any value, then we get the second part.

The second part is easy, so let’s start with it:

  • We sum the 5min rates of app_requests_count, env=prod, 14 days ago.
  • If we get no metrics (e.g. Prometheus was down) we get 0.

The first part is however a bit more complex. Part of it is like the second part, but with an offset of 14d+1h (337h).

Now, to detect if we need the first of the second offset, we add sum(dst{when="14d"}) to the first part. When we need to add an extra hour, then the value of sum(dst{when="14d"}) is 0. Otherwise, there is no value and prometheus falls back to the second part of the rule.

Note: in this rule, the sum in sum(dst{when="14d"}) is here to remove the labels, and allow the + operation.

It is a bit tricky ; but it should do the job. I think I will also in the future create recording rules for day_of_month(), month() and year(), so I can apply an offset to their values.

I will probably revisit this in March 2018…

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 9 November 2017.

Foxyproxy 5.0 is BAD

FoxyProxy devs have deployed a new major release of their addon. It ignores your current config, you need to create a new one. There is no migration path.

Hopefully the new config is still on disk. Simply disable the addon autoupdate and get the 4.6.5 release to get back to work.

I really don’t get why the devs have broken that so useful extension…

Permalink. Category: firefox. Tags: internet.
First published on Wed 6 September 2017.

Custom Sudo command with Ansible

While introducing Ansible at customer, I noticed the following problem while using ansible become feature:

sudo: unable to create /var/log/sudo-io/170718-124212: File exists
sudo: error initializing I/O plugin sudoers_io

That was because in the sudoers configuration, you have:

Defaults syslog=auth,log_input,log_output,iolog_file=%y%m%d-%H%M%S

Which means that sudo sessions will be logged, but there can only be one sudo session per second. While discussing with colleagues what would be the best way to address this – several options are possible: not logging, logging with seq (Defaults:ansible iolog_file=ansible/%{seq}), logging microseconds, … – I have implemented a workaround.

What is interesting is that it uses an undocumented feature: become_exe.

I dropped a file in the ansible repo with the following content:

[privilege_escalation]
become_exe = /bin/sleep 1 && /bin/sudo

And, magically, ansible now waits one second between sudo commands. Enough to let me continue working while looking for the best resolution.

Permalink. Category: automation. Tags: linux security ansible planet-inuits.
First published on Thu 20 July 2017.