How to gracefully avoid divide by zero in Prometheus

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

This is the correct behaviour, NaN is what you want the result to be.

aggregations work OK.

You can't aggregate ratios. You need to aggregate the numerator and denominator separately and then divide.

So:

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))

Finally I have a solution for my specific problem:

Having a devision by zero leads to a NaN display - that is fine as a technical result and correct but not what the user wants to see (does not fulfil the business requirement).

So I searched a bit and found a "solution" for my problem in the grafana community:

Surround your problematic value with max(YOUR_PROLEMATIC_QUERY, or vector(-1)). An additional value mapping then leads to a useful output.

(Of course you have to adapt the solution to your problem... min/max... vector(42)/vector(101)/vector(...))

Update (1)

Okay. However. It seems to be a bit more tricky based on the query. For example I have another query that fails with NaN as a result of a devision by zero. The above solution does not work. I had to surround the query with brackets and added > 0 or on() vector(100).


Just add > smallest_value to the query before wrapping it into aggregate function such as avg(), where smallest_value is the value, which is smaller than any expected valid result for the inner query. For example:

avg((
  rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
  /
  rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])
) > -1e12)

Prometheus removes NaN values when comparing them to any number with > operator. For example, NaN >bool -1e12 . The same applies to < operator as well, e.g. NaN <bool 1e12 . So either > or < may be used for filtering NaN values before aggregating them with aggregate functions.

P.S. This trick isn't needed in MetricsQL, since VictoriaMtrics automatically skips NaN values when aggregate functions are applied to them.

Tags:

Prometheus