Dealing with flapping metrics in prometheus

Prometheus allows you to get metrics from a lot of systems.

We are integrated with third party suppliers that expose us a balance, an amount of resources we can use.

That is exposed as the following metric:

available_sms{route="1",env="prod"} 1000

This is a gauge, therefore we can write an alerting rule like this:

- alert: No more SMS
  expr: |
    available_sms < 1000

That works well.. when the provider API is available. In our use case, sometimes, the api is refusing access for 10 minutes. Which means that if our balance is below 1000 we will get two tickets as the alert will start twice.

An alternative would be to do:

- alert: No more SMS
  expr: |
    max_over_time(available_sms[1h]) < 1000

Picking min_over_time means that the alert will be resolved only one hour after the original result. max_over_time means that the alert will be triggered one hour too late.

We use an alternative approach, which is to record the last known value:

- record: available_sms_last
  expr: available_sms or available_sms_last
- alert: No more SMS
  expr: |
    available_sms_last < 1000
- alert: No more SMS balance
  expr: |
  for: 1h

That rule will ensure that in case the api is not available, the available_sms_last metric will contain the last known value. We can therefore alert on that, without alerting too soon or too late! This is using prometheus 1-to-1 vector matching.

Another alert, on absent(available_sms) enables us to know when the api is down for a long time.

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 21 February 2019.