ZFS Prometheus exporter
You can use a zfs exporter to create alerts on your ZFS pools, filesystems, snapshots and volumes.
Available metrics⚑
It's not easy to match the exporter metrics with the output of zfs list -o space
. Here is a correlation table:
- USED:
zfs_dataset_used_bytes{type="filesystem"}
- AVAIL:
zfs_dataset_available_bytes{type="filesystem"}
- LUSED:
zfs_dataset_logical_used_bytes{type="filesystem"}
- USEDDS:
zfs_dataset_used_by_dataset_bytes="filesystem"}
- USEDSNAP: Currently there is no published metric to get this data. You can either use
zfs_dataset_used_bytes - zfs_dataset_used_by_dataset_bytes
which will show wrong data if the dataset has children or try to dosum by (hostname,filesystem) (zfs_dataset_used_bytes{type='snapshot'})
which returns smaller sizes than expected.
Installation⚑
Install the exporter⚑
Download the latest release for your platform, and unpack it somewhere on your filesystem.
If you use ansible you can use the next task:
---
- name: Test if zfs_exporter binary exists
stat:
path: /usr/local/bin/zfs_exporter
register: zfs_exporter_binary
- name: Install the zfs exporter
block:
- name: Download the zfs exporter
delegate_to: localhost
ansible.builtin.unarchive:
src: https://github.com/pdf/zfs_exporter/releases/download/v{{ zfs_exporter_version }}/zfs_exporter-{{ zfs_exporter_version }}.linux-amd64.tar.gz
dest: /tmp/
remote_src: yes
- name: Upload the zfs exporter to the server
become: true
copy:
src: /tmp/zfs_exporter-{{ zfs_exporter_version }}.linux-amd64/zfs_exporter
dest: /usr/local/bin
mode: 0755
when: not zfs_exporter_binary.stat.exists
- name: Create the systemd service
become: true
template:
src: service.j2
dest: /etc/systemd/system/zfs_exporter.service
notify: Restart the service
With this service template
[Unit]
Description=zfs_exporter
After=network-online.target
[Service]
Restart=always
RestartSec=5
TimeoutSec=5
User=root
Group=root
ExecStart=/usr/local/bin/zfs_exporter {{ zfs_exporter_arguments }}
[Install]
WantedBy=multi-user.target
This defaults file:
---
zfs_exporter_version: 2.2.8
zfs_exporter_arguments: --collector.dataset-snapshot
And this handler:
- name: Restart the service
become: true
systemd:
name: zfs_exporter
enabled: true
daemon_reload: true
state: restarted
I know I should publish a role, but I'm lazy right now :P
Configure the exporter⚑
Configure the scraping in your prometheus configuration:
- job_name: zfs_exporter
metrics_path: /metrics
scrape_timeout: 60s
static_configs:
- targets: [192.168.3.236:9134]
metric_relabel_configs:
- source_labels: ['name']
regex: ^([^@]*).*$
target_label: filesystem
replacement: ${1}
- source_labels: ['name']
regex: ^.*:.._(.*)$
target_label: snapshot_type
replacement: ${1}
Remember to set the scrape_timeout
to at least of 60s
as the exporter is sometimes slow to answer, specially on low hardware resources.
The relabelings are done to be able to extract the filesystem
and the backup type of the snapshots' metrics. This assumes that you are using sanoid
to do the backups, which gives metrics such as:
zfs_dataset_written_bytes{name="main/apps/nginx_public@autosnap_2023-06-03_00:00:18_daily",pool="main",type="snapshot"} 0
For that metric you'll get that the filesystem
is main/apps/nginx_public
and the backup type is daily
.
Configure the alerts⚑
The people of Awesome Prometheus Alerts give some good ZFS alerts for this exporter:
- alert: ZfsPoolOutOfSpace
expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0
for: 5m
labels:
severity: warning
annotations:
summary: ZFS pool out of space (instance {{ $labels.instance }})
description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsPoolUnhealthy
expr: last_over_time(zfs_pool_health[1h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: ZFS pool unhealthy (instance {{ $labels.instance }})
description: "ZFS pool state is {{ $value }}. Where:\n - 0: ONLINE\n - 1: DEGRADED\n - 2: FAULTED\n - 3: OFFLINE\n - 4: UNAVAIL\n - 5: REMOVED\n - 6: SUSPENDED\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsCollectorFailed
expr: zfs_scrape_collector_success != 1
for: 5m
labels:
severity: warning
annotations:
summary: ZFS collector failed (instance {{ $labels.instance }})
description: "ZFS collector for {{ $labels.instance }} has failed to collect information\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Snapshot alerts⚑
You can also monitor the status of the snapshots.
- alert: ZfsDatasetWithNoSnapshotsError
expr: zfs_dataset_used_by_dataset_bytes{type="filesystem"} > 200e3 unless on (hostname,filesystem) count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type="snapshot"}) > 1
for: 5m
labels:
severity: error
annotations:
summary: The dataset {{ $labels.filesystem }} at {{ $labels.hostname }} doesn't have any snapshot.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeFrequentlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='frequently'})[60m:15m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60m:15m]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the frequently snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeHourlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='hourly'})[2h:30m]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2h:30m]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeDailySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'})[2d:12h]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[2d:12h]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the daily snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeMonthlySizeError
expr: increase(sum by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'})[60d:15d]) == 0 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[60d:15d]) == 4
for: 5m
labels:
severity: error
annotations:
summary: The size of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system or the data has not changed in the last hour\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeFrequentlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="frequently",type="snapshot"}) < 4)[16m:8m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[16m:8m]) == 2
for: 5m
labels:
severity: error
annotations:
summary: The number of the frequent snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeHourlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{snapshot_type="hourly",type="snapshot"}) < 24)[1h10m:10m]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1h10m:10m]) == 7
for: 5m
labels:
severity: error
annotations:
summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeDailyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='daily'}) < 30)[1d2h:2h]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[1d2h:2h]) == 13
for: 5m
labels:
severity: error
annotations:
summary: The number of the hourly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ZfsSnapshotTypeMonthlyUnexpectedNumberError
expr: increase((count by (hostname, filesystem, job) (zfs_dataset_used_bytes{type='snapshot',snapshot_type='monthly'}) < 6)[31d:1d]) < 1 and count_over_time(zfs_dataset_used_bytes{type="filesystem"}[31d:1d]) == 31
for: 5m
labels:
severity: error
annotations:
summary: The number of the monthly snapshots has not changed for the dataset {{ $labels.filesystem }} at {{ $labels.hostname }}.
description: "There might be an error on the snapshot system\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- record: zfs_dataset_snapshot_bytes
# This expression is not real for datasets that have children, so we're going to create this metric only for those datasets that don't have children
# I'm also going to assume that the datasets that have children don't hold data
expr: zfs_dataset_used_bytes - zfs_dataset_used_by_dataset_bytes and zfs_dataset_used_by_dataset_bytes > 200e3
- alert: ZfsSnapshotTooMuchSize
expr: zfs_dataset_snapshot_bytes / zfs_dataset_used_by_dataset_bytes > 2 and zfs_dataset_snapshot_bytes > 10e9
for: 5m
labels:
severity: warning
annotations:
summary: The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than two times the data space
description: "The snapshots of the dataset {{ $labels.filesystem }} at {{ $labels.hostname }} use more than two times the data space\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Useful inhibits⚑
Some you may want to inhibit some of these rules for some of your datasets. These subsections should be added to the alertmanager.yml
file under the inhibit_rules
field.
Ignore snapshots on some datasets⚑
Sometimes you don't want to do snapshots on a dataset
- target_matchers:
- alertname = ZfsDatasetWithNoSnapshotsError
- hostname = my_server_1
- filesystem = tmp
Ignore snapshots growth⚑
Sometimes you don't mind if the size of the data saved in the filesystems doesn't change too much between snapshots doesn't change much specially in the most frequent backups because you prefer to keep the backup cadence. It's interesting to have the alert though so that you can get notified of the datasets that don't change that much so you can tweak your backup policy (even if zfs snapshots are almost free).
- target_matchers:
- alertname =~ "ZfsSnapshotType(Frequently|Hourly)SizeError"
- filesystem =~ "(media/(docs|music))"