Node Exporter
Node Exporter is a Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.
Install⚑
To install in kubernetes nodes, use this chart. Elsewhere use this ansible role.
If you use node exporter agents outside kubernetes, you need to configure a prometheus service discovery to scrap the information from them.
To auto discover EC2 instances use the ec2_sd_config configuration. It can be added in the helm chart values.yaml under the key prometheus.prometheusSpec.additionalScrapeConfigs
.
- job_name: node_exporter
ec2_sd_configs:
- region: us-east-1
port: 9100
refresh_interval: 1m
relabel_configs:
- source_labels: ['__meta_ec2_tag_Name', '__meta_ec2_private_ip']
separator: ':'
target_label: instance
- source_labels:
- __meta_ec2_instance_type
target_label: instance_type
The relabel_configs
part will substitute the instance
label of each target from {{ instance_ip }}:9100
to {{ instance_name }}:{{ instance_ip }}
.
If the worker nodes already have an IAM role with the ec2:DescribeInstances
permission there is no need to specify the role_arn
or access_keys
and secret_key
.
If you have stopped instances, the node exporter will raise an alert because it won't be able to scrape the metrics from them. To only fetch data from running instances add a filter:
ec2_sd_configs:
- region: us-east-1
filters:
- name: instance-state-name
values:
- running
To monitor only the instances of a list of VPCs use this filter:
ec2_sd_configs:
- region: us-east-1
filters:
- name: vpc-id
values:
- vpc-xxxxxxxxxxxxxxxxx
- vpc-yyyyyyyyyyyyyyyyy
By default, prometheus will try to scrape the private instance ip. To use the public one you need to relabel it with the following snippet:
ec2_sd_configs:
- region: us-east-1
relabel_configs:
- source_labels: ['__meta_ec2_public_ip']
regex: ^(.*)$
target_label: __address__
replacement: ${1}:9100
I'm using the 11074
grafana dashboards for the blackbox exporter, which worked straight out of the box. Taking as reference the grafana helm chart values, add the following yaml under the grafana
key in the prometheus-operator
values.yaml
.
grafana:
enabled: true
defaultDashboardsEnabled: true
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
node_exporter:
# Ref: https://grafana.com/dashboards/11074
gnetId: 11074
revision: 4
datasource: Prometheus
And install.
helmfile diff
helmfile apply
Node exporter size analysis⚑
Once the instance metrics are being ingested, we can do a periodic analysis to deduce which instances are undersized or oversized.
Node exporter alerts⚑
Now that we've got the metrics, we can define the alert rules. Most have been tweaked from the Awesome prometheus alert rules collection.
Host out of memory⚑
Node memory is filling up (< 10%
left).
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
message: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host memory under memory pressure⚑
The node is under heavy memory pressure. High rate of major page faults.
- alert: HostMemoryUnderMemoryPressure
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Host memory under memory pressure (instance {{ $labels.instance }})"
message: "The node is under heavy memory pressure. High rate of major page faults."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual network throughput in⚑
Host network interfaces are probably receiving too much data (> 100 MB/s)
- alert: HostUnusualNetworkThroughputIn
expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual network throughput in (instance {{ $labels.instance }})"
message: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual network throughput out⚑
Host network interfaces are probably sending too much data (> 100 MB/s)
- alert: HostUnusualNetworkThroughputOut
expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual network throughput out (instance {{ $labels.instance }})"
message: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual disk read rate⚑
Disk is probably reading too much data (> 50 MB/s)
- alert: HostUnusualDiskReadRate
expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk read rate (instance {{ $labels.instance }})"
message: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual disk write rate⚑
Disk is probably writing too much data (> 50 MB/s)
- alert: HostUnusualDiskWriteRate
expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk write rate (instance {{ $labels.instance }})"
message: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host out of disk space⚑
Disk is worryingly almost full (< 10% left
).
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs"} < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
message: "Host disk is almost full (< 10% left)\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Disk is almost full (< 20% left
)
- alert: HostReachingOutOfDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs"} * 100) / node_filesystem_size_bytes{fstype!~"tmpfs"} < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Host reaching out of disk space (instance {{ $labels.instance }})"
message: "Host disk is almost full (< 20% left)\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host disk will fill in 4 hours⚑
Disk will fill in 4 hours at current write rate
- alert: HostDiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Host disk will fill in 4 hours (instance {{ $labels.instance }})"
message: "Disk will fill in 4 hours at current write rate\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host out of inodes⚑
Disk is almost running out of available inodes (< 10% left
).
- alert: HostOutOfInodes
expr: node_filesystem_files_free{fstype!~"tmpfs"} / node_filesystem_files{fstype!~"tmpfs"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of inodes (instance {{ $labels.instance }})"
message: "Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual disk read latency⚑
Disk latency is growing (read operations > 100ms).
- alert: HostUnusualDiskReadLatency
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk read latency (instance {{ $labels.instance }})"
message: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host unusual disk write latency⚑
Disk latency is growing (write operations > 100ms)
- alert: HostUnusualDiskWriteLatency
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Host unusual disk write latency (instance {{ $labels.instance }})"
message: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host high CPU load⚑
CPU load is > 80%
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Host high CPU load (instance {{ $labels.instance }})"
message: "CPU load is > 80%\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host context switching⚑
Context switching is growing on node (> 1000 / s)
# 1000 context switches is an arbitrary number.
# Alert threshold depends on nature of application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
- alert: HostContextSwitching
expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Host context switching (instance {{ $labels.instance }})"
message: "Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host swap is filling up⚑
Swap is filling up (>80%)
- alert: HostSwapIsFillingUp
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Host swap is filling up (instance {{ $labels.instance }})"
message: "Swap is filling up (>80%)\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host SystemD service crashed⚑
SystemD service crashed
- alert: HostSystemdServiceCrashed
expr: node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Host SystemD service crashed (instance {{ $labels.instance }})"
message: "SystemD service crashed\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host physical component too hot⚑
Physical hardware component too hot
- alert: HostPhysicalComponentTooHot
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
severity: warning
annotations:
summary: "Host physical component too hot (instance {{ $labels.instance }})"
message: "Physical hardware component too hot\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host node overtemperature alarm⚑
Physical node temperature alarm triggered
- alert: HostNodeOvertemperatureAlarm
expr: node_hwmon_temp_alarm == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Host node overtemperature alarm (instance {{ $labels.instance }})"
message: "Physical node temperature alarm triggered\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host RAID array got inactive⚑
RAID array {{ $labels.device }}
is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.
- alert: HostRaidArrayGotInactive
expr: node_md_state{state="inactive"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Host RAID array got inactive (instance {{ $labels.instance }})"
message: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host RAID disk failure⚑
At least one device in RAID array on {{ $labels.instance }}
failed. Array {{ $labels.md_device }}
needs attention and possibly a disk swap.
- alert: HostRaidDiskFailure
expr: node_md_disks{state="fail"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Host RAID disk failure (instance {{ $labels.instance }})"
message: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}."
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host kernel version deviations⚑
Different kernel versions are running.
- alert: HostKernelVersionDeviations
expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Host kernel version deviations (instance {{ $labels.instance }})"
message: "Different kernel versions are running\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host OOM kill detected⚑
OOM kill detected
- alert: HostOomKillDetected
expr: increase(node_vmstat_oom_kill[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Host OOM kill detected (instance {{ $labels.instance }})"
message: "OOM kill detected\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host Network Receive Errors⚑
{{ $labels.instance }}
interface {{ $labels.device }}
has encountered {{ printf "%.0f" $value }}
receive errors in the last five minutes.
- alert: HostNetworkReceiveErrors
expr: increase(node_network_receive_errs_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Host Network Receive Errors (instance {{ $labels.instance }})"
message: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf '%.0f' $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"
Host Network Transmit Errors⚑
{{ $labels.instance }}
interface {{ $labels.device }}
has encountered {{ printf "%.0f" $value }}
transmit errors in the last five minutes.
- alert: HostNetworkTransmitErrors
expr: increase(node_network_transmit_errs_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Host Network Transmit Errors (instance {{ $labels.instance }})"
message: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf '%.0f' $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}"
grafana: "{{ grafana_url}}?var-job=node_exporter&var-hostname=All&var-node={{ $labels.instance }}"