Sanoid

Sanoid is the most popular tool right now, with it you can create, automatically thin, and monitor snapshots and pool health from a single eminently human-readable TOML config file at /etc/sanoid/sanoid.conf. Sanoid also requires a "defaults" file located at /etc/sanoid/sanoid.defaults.conf, which is not user-editable. A typical Sanoid system would have a single cron job:

* * * * * TZ=UTC /usr/local/bin/sanoid --cron

Note: Using UTC as timezone is recommend to prevent problems with daylight saving times

And its /etc/sanoid/sanoid.conf might look something like this:

[data/home]
    use_template = production
[data/images]
    use_template = production
    recursive = yes
    process_children_only = yes
[data/images/win7]
    hourly = 4

#############################
# templates below this line #
#############################

[template_production]
        frequently = 0
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes

Which would be enough to tell sanoid to take and keep 36 hourly snapshots, 30 dailies, 3 monthlies, and no yearlies for all datasets under data/images (but not data/images itself, since process_children_only is set). Except in the case of data/images/win7, which follows the same template (since it's a child of data/images) but only keeps 4 hourlies for whatever reason.

For more full details on sanoid.conf settings see their wiki page

The monitorization is designed to be done with Nagios, although there is some work in progress to add Prometheus metrics and there is an exporter

What I like of sanoid:

It's popular
It has hooks to run your scripts at various stages in the lifecycle of a snapshot.
It also handles the process of sending the backups to other locations with syncoid
It lets you search on all changes of a given file (or folder) over all available snapshots. This is useful in case you need to recover a file or folder but don't want to rollback an entire snapshot. with findoid (although when I used it it gave me an error :/)
It's in the official repos

What I don't like:

Last release is almost 2 years ago.
The last commit to master is done a year ago.
It's made in Perl

Installation⚑

Stable version⚑

The tool is in the official repositories so:

sudo apt-get install sanoid

Latest version⚑

cd /tmp
git clone https://github.com/jimsalterjrs/sanoid.git
cd sanoid
# checkout latest stable release or stay on master for bleeding edge stuff (but expect bugs!)
git checkout $(git tag | grep "^v" | tail -n 1)
ln -s packages/debian .
dpkg-buildpackage -uc -us
sudo apt install ../sanoid_*_all.deb

Enable sanoid timer:

# enable and start the sanoid timer
sudo systemctl enable --now sanoid.timer

Configuration⚑

You can find the example config file at /usr/share/doc/sanoid/examples/sanoid.conf and can copy it to /etc/sanoid/sanoid.conf

mkdir /etc/sanoid/
cp /usr/share/doc/sanoid/examples/sanoid.conf /etc/sanoid/sanoid.conf
cp /usr/share/sanoid/sanoid.defaults.conf /etc/sanoid/sanoid.defaults.conf

Edit /etc/sanoid/sanoid.conf to suit your needs. The /etc/sanoid/sanoid.defaults.conf file contains the default values and should not be touched, use it only for reference.

An example of a configuration can be:

######################
# Filesystem Backups #
######################

[main/backup]
  use_template = daily
  recursive = yes
[main/lyz]
  use_template = frequent

#############
# Templates #
#############

[template_daily]
  daily = 30
  monthly = 6

[template_frequent]
  frequently = 4
  hourly = 25
  daily = 30
  monthly = 6

During installation from the Debian repositories, the systemd timer unit sanoid.timer is created which is set to run sanoid every 15 minutes. Therefore there is no need to create an entry in crontab. Having a crontab entry in addition to the sanoid.timer will result in errors similar to cannot create snapshot '<pool>/<dataset>@<snapshot>': dataset already exists.

By default, the sanoid.timer timer unit runs the sanoid-prune service followed by the sanoid service. To edit any of the command-line options, you can edit these service files (/lib/systemd/system/sanoid.timer).

Also recursive is not set by default, so the dataset's children won't be backed up unless you set this option.

Usage⚑

sanoid runs in the back with the systemd service, so there is nothing you need to do for it to run.

To check the logs use journalctl -eu sanoid.

To manage the snapshots look at the zfs article.

Prune snapshots⚑

If you want to manually prune the snapshots after you tweaked sanoid.conf you can run:

sanoid --prune-snapshots

Syncoid ⚑

Sanoid also includes a replication tool, syncoid, which facilitates the asynchronous incremental replication of ZFS filesystems. A typical syncoid command might look like this:

syncoid data/images/vm backup/images/vm

Which would replicate the specified ZFS filesystem (aka dataset) from the data pool to the backup pool on the local system, or

syncoid data/images/vm root@remotehost:backup/images/vm

Which would push-replicate the specified ZFS filesystem from the local host to remotehost over an SSH tunnel, or

syncoid root@remotehost:data/images/vm backup/images/vm

Which would pull-replicate the filesystem from the remote host to the local system over an SSH tunnel. In case of doubt using the pull strategy is always desired

Syncoid supports recursive replication (replication of a dataset and all its child datasets) and uses mbuffer buffering, lzop compression, and pv progress bars if the utilities are available on the systems used. If ZFS supports resumeable send/receive streams on both the source and target those will be enabled as default. It also automatically supports and enables resume of interrupted replication when both source and target support this feature.

Configuration⚑

Syncoid configuration caveats ⚑

One key point is that pruning is not done by syncoid but only and always by sanoid. This means sanoid has to be run on the backup datasets as well, but without creating snapshots, only pruning (as set in the template).

Also, the template is called template_something and only something must be use with use_template.

[SAN200/projects]
        use_template = production
        recursive = yes
        process_children_only = yes

[BACKUP/SAN200/projects]
        use_template = backup
        recursive = yes
        process_children_only = yes

Also note that post_snapshot_script cannot be used with syncoid especially with recursive = yes. This is because there cannot be two zfs send and receive at the same time on the same dataset.

sanoid does not wait for the script completion before continuing. This mean that should the syncoid process take a bit too much time, a new one will be spawned. And for reasons unknown to me yet, a new syncoid process will cancel the previous one (instead of just leaving). As some of the spawned syncoid will produce errors, the entire sanoid process will fail.

So this approach does not work and has to be done independently, it seems. The good news is that the SystemD service of Type= oneshot can have several Execstart= lines.

Send encrypted backups to a encrypted dataset⚑

syncoid's default behaviour is to create the destination dataset without encryption so the snapshots are transferred and can be read without encryption. You can check this with the zfs get encryption,keylocation,keyformat command both on source and destination.

To prevent this from happening you have to [pass the --sendoptions='w'](https://github.com/jimsalterjrs/sanoid/issues/548) tosyncoidso that it tells zfs to send a raw stream. If you do so, you also need to [transfer the key file](https://github.com/jimsalterjrs/sanoid/issues/648) to the destination server so that it can do azfs loadkey` and then mount the dataset. For example:

server-host:$ sudo zfs list -t filesystem
NAME                    USED  AVAIL     REFER  MOUNTPOINT
server_data             232M  38.1G      230M  /var/server_data
server_data/log         111K  38.1G      111K  /var/server_data/log
server_data/mail        111K  38.1G      111K  /var/server_data/mail
server_data/nextcloud   111K  38.1G      111K  /var/server_data/nextcloud
server_data/postgres    111K  38.1G      111K  /var/server_data/postgres

server-host:$ sudo zfs get keylocation server_data/nextcloud
NAME                   PROPERTY     VALUE                                    SOURCE
server_data/nextcloud  keylocation  file:///root/zfs_dataset_nextcloud_pass  local

server-host:$ sudo syncoid --recursive --skip-parent --sendoptions=w server_data root@192.168.122.94:backup_pool
INFO: Sending oldest full snapshot server_data/log@autosnap_2021-06-18_18:33:42_yearly (~ 49 KB) to new target filesystem:
17.0KiB 0:00:00 [1.79MiB/s] [=================================================>                                                                                                  ] 34%
INFO: Updating new target filesystem with incremental server_data/log@autosnap_2021-06-18_18:33:42_yearly ... syncoid_caedrium.com_2021-06-22:10:12:55 (~ 15 KB):
41.2KiB 0:00:00 [78.4KiB/s] [===================================================================================================================================================] 270%
INFO: Sending oldest full snapshot server_data/mail@autosnap_2021-06-18_18:33:42_yearly (~ 49 KB) to new target filesystem:
17.0KiB 0:00:00 [ 921KiB/s] [=================================================>                                                                                                  ] 34%
INFO: Updating new target filesystem with incremental server_data/mail@autosnap_2021-06-18_18:33:42_yearly ... syncoid_caedrium.com_2021-06-22:10:13:14 (~ 15 KB):
41.2KiB 0:00:00 [49.4KiB/s] [===================================================================================================================================================] 270%
INFO: Sending oldest full snapshot server_data/nextcloud@autosnap_2021-06-18_18:33:42_yearly (~ 49 KB) to new target filesystem:
17.0KiB 0:00:00 [ 870KiB/s] [=================================================>                                                                                                  ] 34%
INFO: Updating new target filesystem with incremental server_data/nextcloud@autosnap_2021-06-18_18:33:42_yearly ... syncoid_caedrium.com_2021-06-22:10:13:42 (~ 15 KB):
41.2KiB 0:00:00 [50.4KiB/s] [===================================================================================================================================================] 270%
INFO: Sending oldest full snapshot server_data/postgres@autosnap_2021-06-18_18:33:42_yearly (~ 50 KB) to new target filesystem:
17.0KiB 0:00:00 [1.36MiB/s] [===============================================>                                                                                                    ] 33%
INFO: Updating new target filesystem with incremental server_data/postgres@autosnap_2021-06-18_18:33:42_yearly ... syncoid_caedrium.com_2021-06-22:10:14:11 (~ 15 KB):
41.2KiB 0:00:00 [48.9KiB/s] [===================================================================================================================================================] 270%

server-host:$ sudo scp /root/zfs_dataset_nextcloud_pass 192.168.122.94:

backup-host:$ sudo zfs set keylocation=file:///root/zfs_dataset_nextcloud_pass  backup_pool/nextcloud
backup-host:$ sudo zfs load-key backup_pool/nextcloud
backup-host:$ sudo zfs mount backup_pool/nextcloud

If you also want to keep the encryptionroot you need to let zfs take care of the recursion instead of syncoid. In this case you can't use syncoid's stuff like --exclude from the manpage of zfs:

-R, --replicate
   Generate a replication stream package, which will replicate the specified file system, and all descendent file systems, up to the named snapshot.  When received, all properties, snap‐
   shots, descendent file systems, and clones are preserved.

   If the -i or -I flags are used in conjunction with the -R flag, an incremental replication stream is generated.  The current values of properties, and current snapshot and file system
   names are set when the stream is received.  If the -F flag is specified when this stream is received, snapshots and file systems that do not exist on the sending side are destroyed.
   If the -R flag is used to send encrypted datasets, then -w must also be specified.

In this case this should work:

/sbin/syncoid --recursive --force-delete --sendoptions="Rw" zpool/backups zfs-recv@10.29.3.27:zpool/backups

Monitorization⚑

You can monitor this issue with loki using the next alerts:

groups:
  - name: zfs
    rules:
      - alert: SyncoidCorruptedSnapshotSendError
        expr: |
          count_over_time({syslog_identifier="syncoid_send_backups"} |= `cannot receive incremental stream: invalid backup stream` [15m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Error trying to send a corrupted snapshot at {{ $labels.hostname}}"
          message: "Look at the context on loki to identify the snapshot in question. Delete it and then run the sync again"
      - alert: SanoidNotRunningError
        expr: |
          sum by (hostname) (count_over_time({job="systemd-journal", syslog_identifier="sanoid"}[1h])) or sum by (hostname) (count_over_time({job="systemd-journal"}[1h]) * 0)
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Sanoid has not shown signs to be alive for the last hour at least in arva and helm"
      - alert: ErrorInSanoidLogs
        expr: |
          count_over_time({job="systemd-journal", syslog_identifier="sanoid"} |= `ERROR` [5m])
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Errors found on sanoid log at {{ $labels.hostname}}"
      - alert: SlowSpaSyncZFSError
        expr: |
          count_over_time({job="zfs"} |~ `spa_deadman.*slow spa_sync` [10m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Slow sync traces found in the ZFS debug logs at {{ $labels.hostname}}"
          message: "This usually happens before the ZFS becomes unresponsible"

The SanoidNotRunningError alert uses a broader search that ensures that all hosts are included and multiplies it to 0 to raise the alert if none is shown for the sanoid service.

Troubleshooting⚑

ERROR: No valid lockfile found - Did a rogue process or user update or delete it?⚑

Usually it's because many sanoid commands are running at the same time. This is often the case if you're doing a zfs scrub as sanoid commands take longer to run.

Syncoid no tty present and no askpass program specified ⚑

If you try to just sync a ZFS dataset between two machines, something like syncoid pool/dataset user@remote:pool/dataset, you’ll eventually see syncoid throwing a sudo error: sudo: no tty present and no askpass program specified. That’s because it’s trying to run a sudo command on the remote, and sudo doesn’t have a way to ask for a password with the way syncoid’s running commands in the remote.

Searching online, many people just saying to enable SSH as root, which might be fine on a local network, but not the best solution. Instead, you can enable passwordless sudo for zfs commands on a unprivileged user. Getting this done was very simple:

sudo visudo /etc/sudoers.d/zfs_receive_for_syncoid

And then fill it with the following:

<your user> ALL=NOPASSWD: /usr/sbin/zfs *

If you really want to put in the effort, you can even take a look at which zfs commands that syncoid is actually invoking, and then restrict passwordless sudo only for those commands. It’s important that you do this for all commands that syncoid uses. Syncoid runs a few zfs commands with sudo to list snapshots and get some other information on the remote machine before doing the transfer.

References⚑

Source
Docs

Sanoid

Installation⚑

Stable version⚑

Latest version⚑

Configuration⚑

Usage⚑

Prune snapshots⚑

Syncoid⚑

Configuration⚑

Syncoid configuration caveats⚑

Send encrypted backups to a encrypted dataset⚑

Monitorization⚑

Troubleshooting⚑

ERROR: No valid lockfile found - Did a rogue process or user update or delete it?⚑

Syncoid no tty present and no askpass program specified⚑

References⚑

Syncoid ⚑

Syncoid configuration caveats ⚑

Syncoid no tty present and no askpass program specified ⚑