Instance sizing analysis
Once we gather the instance metrics with the Node exporter, we can do statistical analysis on the evolution of time to detect the instances that are undersized or oversized.
RAM analysis⚑
Instance RAM percent usage metric can be calculated with the following Prometheus rule:
- record: instance_path:node_memory_MemAvailable_percent
expr: (1 - node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes ) * 100
The average, standard deviation and the standard score of the last two weeks would be:
- record: instance_path:node_memory_MemAvailable_percent:avg_over_time_2w
expr: avg_over_time(instance_path:node_memory_MemAvailable_percent[2w])
- record: instance_path:node_memory_MemAvailable_percent:stddev_over_time_2w
expr: stddev_over_time(instance_path:node_memory_MemAvailable_percent[2w])
- record: instance_path:node_memory_MemAvailable_percent:z_score
expr: >
(
instance_path:node_memory_MemAvailable_percent
- instance_path:node_memory_MemAvailable_percent:avg_over_time_2w
) / instance_path:node_memory_MemAvailable_percent:stddev_over_time_2w
With that data we can define that an instance is oversized if the average plus the standard deviation is less than 60% and undersized if its greater than 90%.
With the average we take into account the nominal RAM consumption, and with the standard deviation we take into account the spikes.
Tweak this rule to your use case
The criteria of undersized and oversized is just a first approximation I'm going to use. You can use it as a base criteria, but don't go through with it blindly.
See the disclaimer below for more information.
# RAM
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_memory_MemAvailable_percent:avg_plus_stddev_over_time_2w < 60
labels:
type: EC2
metric: RAM
problem: oversized
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_memory_MemAvailable_percent:avg_plus_stddev_over_time_2w > 90
labels:
type: EC2
metric: RAM
problem: undersized
Where avg_plus_stddev_over_time_2w
is:
- record: instance_path:node_memory_MemAvailable_percent:avg_plus_stddev_over_time_2w
expr: >
instance_path:node_memory_MemAvailable_percent:avg_over_time_2w
+ instance_path:node_memory_MemAvailable_percent:stddev_over_time_2w
CPU analysis⚑
Instance CPU percent usage metric can be calculated with the following Prometheus rule:
- record: instance_path:node_cpu_percent:rate1m
expr: >
(1 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])))) * 100
The node_cpu_seconds_total
doesn't give us the percent of usage, that is why we need to do the average of the rate
of the last minute.
The average, standard deviation, the standard score and the undersize or oversize criteria is similar to the RAM case, so I'm adding it folded for reference only.
CPU usage rules
# ---------------------------------------
# -- Resource consumption calculations --
# ---------------------------------------
# CPU
- record: instance_path:node_cpu_percent:rate1m
expr: >
(1 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])))) * 100
- record: instance_path:node_cpu_percent:rate1m:avg_over_time_2w
expr: avg_over_time(instance_path:node_cpu_percent:rate1m[2w])
- record: instance_path:node_cpu_percent:rate1m:stddev_over_time_2w
expr: stddev_over_time(instance_path:node_cpu_percent:rate1m[2w])
- record: instance_path:node_cpu_percent:rate1m:avg_plus_stddev_over_time_2w
expr: >
instance_path:node_cpu_percent:rate1m:avg_over_time_2w
+ instance_path:node_cpu_percent:rate1m:stddev_over_time_2w
- record: instance_path:node_cpu_percent:rate1m:z_score
expr: >
(
instance_path:node_cpu_percent:rate1m
- instance_path:node_cpu_percent:rate1m:avg_over_time_2w
) / instance_path:node_cpu_percent:rate1m:stddev_over_time_2w
# ----------------------------------
# -- Resource sizing calculations --
# ----------------------------------
# CPU
- record: instance_path:wrong_resource_size
expr: instance_path:node_cpu_percent:rate1m:avg_plus_stddev_over_time_2w < 60
labels:
type: EC2
metric: CPU
problem: oversized
- record: instance_path:wrong_resource_size
expr: instance_path:node_cpu_percent:rate1m:avg_plus_stddev_over_time_2w > 80
labels:
type: EC2
metric: CPU
problem: undersized
Network analysis⚑
We can deduce the network usage from the node_network_receive_bytes_total
and node_network_transmit_bytes_total
metrics. For example for the transmit, the Gigabits per second transmitted can be calculated with the following Prometheus rule:
- record: instance_path:node_network_transmit_gigabits_per_second:rate5m
expr: >
increase(
node_network_transmit_bytes_total{device=~"(eth0|ens.*)"}[1m]
) * 7.450580596923828 * 10^-9 / 60
Where we:
- Filter the traffic only to the external network interfaces
node_network_transmit_bytes_total{device=~"(eth0|ens.*)"}
. Those are the ones used by AWS, but you'll need to tweak that for your case. - Convert the
increase
of Kilobytes per minute[1m]
to Gigabits per second by multiplying it by7.450580596923828 * 10^-9 / 60
.
The average, standard deviation, the standard score and the undersize or oversize criteria is similar to the RAM case, so I'm adding it folded for reference only.
Network usage rules
# ---------------------------------------
# -- Resource consumption calculations --
# ---------------------------------------
# NetworkReceive
- record: instance_path:node_network_receive_gigabits_per_second:rate1m
expr: >
increase(
node_network_receive_bytes_total{device=~"(eth0|ens.*)"}[1m]
) * 7.450580596923828 * 10^-9 / 60
- record: instance_path:node_network_receive_gigabits_per_second:rate1m:avg_over_time_2w
expr: >
avg_over_time(
instance_path:node_network_receive_gigabits_per_second:rate1m[2w]
)
- record: instance_path:node_network_receive_gigabits_per_second:rate1m:stddev_over_time_2w
expr: >
stddev_over_time(
instance_path:node_network_receive_gigabits_per_second:rate1m[2w]
)
- record: instance_path:node_network_receive_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w
expr: >
instance_path:node_network_receive_gigabits_per_second:rate1m:avg_over_time_2w
+ instance_path:node_network_receive_gigabits_per_second:rate1m:stddev_over_time_2w
- record: instance_path:node_network_receive_gigabits_per_second:rate1m:z_score
expr: >
(
instance_path:node_network_receive_gigabits_per_second:rate1m
- instance_path:node_network_receive_gigabits_per_second:rate1m:avg_over_time_2w
) / instance_path:node_network_receive_gigabits_per_second:rate1m:stddev_over_time_2w
# NetworkTransmit
- record: instance_path:node_network_transmit_gigabits_per_second:rate1m
expr: >
increase(
node_network_transmit_bytes_total{device=~"(eth0|ens.*)"}[1m]
) * 7.450580596923828 * 10^-9 / 60
- record: instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_over_time_2w
expr: >
avg_over_time(
instance_path:node_network_transmit_gigabits_per_second:rate1m[2w]
)
- record: instance_path:node_network_transmit_gigabits_per_second:rate1m:stddev_over_time_2w
expr: >
stddev_over_time(
instance_path:node_network_transmit_gigabits_per_second:rate1m[2w]
)
- record: instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w
expr: >
instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_over_time_2w
+ instance_path:node_network_transmit_gigabits_per_second:rate1m:stddev_over_time_2w
- record: instance_path:node_network_transmit_gigabits_per_second:rate1m:z_score
expr: >
(
instance_path:node_network_transmit_gigabits_per_second:rate1m
- instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_over_time_2w
) / instance_path:node_network_transmit_gigabits_per_second:rate1m:stddev_over_time_2w
# ----------------------------------
# -- Resource sizing calculations --
# ----------------------------------
# NetworkReceive
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_network_receive_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w < 0.5
labels:
type: EC2
metric: NetworkReceive
problem: oversized
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_network_receive_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w > 3
labels:
type: EC2
metric: NetworkReceive
problem: undersized
# NetworkTransmit
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w < 0.5
labels:
type: EC2
metric: NetworkTransmit
problem: oversized
- record: instance_path:wrong_resource_size
expr: >
instance_path:node_network_transmit_gigabits_per_second:rate1m:avg_plus_stddev_over_time_2w > 3
labels:
type: EC2
metric: NetworkTransmit
problem: undersized
The difference with network is that we don't have a percent of the total instance bandwidth, In my case, my instances support from 0.5 to 5 Gbps which is more than I need, so most of my instances are marked as oversized with the < 0.5
rule. I will manually study the ones that go over 3 Gbps.
The correct way to do it, is to tag the baseline, burst or/and maximum network performance by instance type. In the AWS case, the data can be extracted using the AWS docs or external benchmarks.
Once you know the network performance per instance type, you can use relabeling in the Node exporter service monitor to add a label like max_network_performance
and use it later in the rules.
If you do follow this path, please contact me or do a pull request so I can test your solution.
Overall analysis⚑
Now that we have all the analysis under the metric instance_path:wrong_resource_size
with labels, we can aggregate them to see the number of rules each instance is breaking with the following rule:
# Mark the number of oversize rules matched by each instance
- record: instance_path:wrong_instance_size
expr: count by (instance) (sum by (metric, instance) (instance_path:wrong_resource_size))
By executing sort_desc(instance_path:wrong_instance_size)
in the Prometheus web application, we'll be able to see such instances.
instance_path:wrong_instance_size{instance="frontend-production:192.168.1.2"} 4
instance_path:wrong_instance_size{instance="backend-production-instance:172.30.0.195"} 2
...
To see the detail of what rules is our instance breaking we can use something like instance_path:wrong_resource_size{instance =~'frontend.*'}
instance_path:wrong_resource_size{instance="fronted-production:192.168.1.2",instance_type="c4.2xlarge",job="node-exporter",metric="RAM",problem="oversized",type="EC2"} 5.126602454544287
instance_path:wrong_resource_size{instance="fronted-production:192.168.1.2",metric="CPU",problem="oversized",type="EC2"} 0.815639209497615
instance_path:wrong_resource_size{device="ens3",instance="fronted-production:192.168.1.2",instance_type="c4.2xlarge",job="node-exporter",metric="NetworkReceive",problem="oversized",type="EC2"} 0.02973250128744766
instance_path:wrong_resource_size{device="ens3",instance="fronted-production:192.168.1.2",instance_type="c4.2xlarge",job="node-exporter",metric="NetworkTransmit",problem="oversized",type="EC2"} 0.01586461503849804
Here we see that the frontend-production
is a c4.2xlarge
instance that is consuming an average plus standard deviation of CPU of 0.81%, RAM 5.12%, NetworkTransmit 0.015Gbps and NetworkReceive 0.029Gbps, which results in an oversized
alert on all four metrics.
If you want to see the evolution over the time, instead of Console
click on Graph
under the text box where you have entered the query.
With this information, we can decide which is the correct instance for each application. Once all instances are migrated to their ideal size, we can add alerts on these metrics so we can have a continuous analysis of our instances. Once I've done it, I'll add the alerts here.
Disclaimer⚑
We haven't tested this rules yet in production to resize our infrastructure (will do soon), so use all the information in this document cautiously.
What I can expect to fail is that the assumption of average plus a standard deviation criteria can not be enough, maybe I need to increase the resolution of the standard deviation so it can be more sensible to the spikes, or we need to use a safety factor of 2 or 3. We'll see :)
Read throughly the Gitlab post on anomaly detection using Prometheus, it's awesome and it may give you insights on why this approach is not working with you, as well as other algorithms that for example take into account the seasonality of the metrics.
In particular it's interesting to analyze your resources z-score
evolution over time, if all values fall in the +4
to -4
range, you can statistically assert that your metric similarly follows the normal distribution, and can assume that any value of z_score above 3 is an anomaly. If your results return with a range of +20
to -20
, the tail is too long and your results will be skewed.
To test it you can use the following queries to test the RAM behaviour, adapt them for the rest of the resources:
# Minimum z_score value
sort_desc(abs((min_over_time(instance_path:node_memory_MemAvailable_percent[1w]) - instance_path:node_memory_MemAvailable_percent:avg_over_time_2w) / instance_path:node_memory_MemAvailable_percent:stddev_over_time_2w))
# Maximum z_score value
sort_desc(abs((max_over_time(instance_path:node_memory_MemAvailable_percent[1w]) - instance_path:node_memory_MemAvailable_percent:avg_over_time_2w) / instance_path:node_memory_MemAvailable_percent:stddev_over_time_2w))
For a less exhaustive but more graphical analysis, execute instance_path:node_memory_MemAvailable_percent:z_score
in Graph
mode. In my case the RAM is in the +-5 interval, with some peaks of 20, but after reviewing instance_path:node_memory_MemAvailable_percent:avg_plus_stddev_over_time_2w
in those periods, I feel it's still safe to use the assumption.
Same criteria applies to instance_path:node_cpu_percent:rate1m:z_score
, instance_path:node_network_receive_gigabits_per_second:rate1m:z_score
, and instance_path:node_network_transmit_gigabits_per_second:rate1m:z_score
, metrics.