Smartctl

Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T. or SMART) is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs). Its primary function is to detect and report various indicators of drive reliability, or how long a drive can function while anticipating imminent hardware failures.

When S.M.A.R.T. data indicates a possible imminent drive failure, software running on the host system may notify the user so action can be taken to prevent data loss, and the failing drive can be replaced and no data is lost.

General information⚑

Accuracy ⚑

A field study at Google covering over 100,000 consumer-grade drives from December 2005 to August 2006 found correlations between certain S.M.A.R.T. information and annualized failure rates:

In the 60 days following the first uncorrectable error on a drive (S.M.A.R.T. attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred.
First errors in reallocations, offline reallocations (S.M.A.R.T. attributes 0xC4 and 0x05 or 196 and 5) and probational counts (S.M.A.R.T. attribute 0xC5 or 197) were also strongly correlated to higher probabilities of failure.
Conversely, little correlation was found for increased temperature and no correlation for usage level. However, the research showed that a large proportion (56%) of the failed drives failed without recording any count in the "four strong S.M.A.R.T. warnings" identified as scan errors, reallocation count, offline reallocation, and probational count.
Further, 36% of failed drives did so without recording any S.M.A.R.T. error at all, except the temperature, meaning that S.M.A.R.T. data alone was of limited usefulness in anticipating failures.

Installation ⚑

On Debian systems:

sudo apt-get install smartmontools

By default when you install it all your drives are checked periodically with the smartd daemon under the smartmontools systemd service.

Usage⚑

Running the tests⚑

You can run the tests even if the disk is in use as the checks do not modify anything on your disk and thus will not interfere with your normal usage.

Test types ⚑

S.M.A.R.T. drives may offer a number of self-tests:

Short: Checks the electrical and mechanical performance as well as the read performance of the disk. Electrical tests might include a test of buffer RAM, a read/write circuitry test, or a test of the read/write head elements. Mechanical test includes seeking and servo on data tracks. Scans small parts of the drive's surface (area is vendor-specific and there is a time limit on the test). Checks the list of pending sectors that may have read errors, and it usually takes under two minutes.
Long/extended: A longer and more thorough version of the short self-test, scanning the entire disk surface with no time limit. This test usually takes several hours, depending on the read/write speed of the drive and its size. It is possible for the long test to pass even if the short test fails.
Conveyance: Intended as a quick test to identify damage incurred during transporting of the device from the drive manufacturer to the computer manufacturer. Only available on ATA drives, and it usually takes several minutes.

Drives remain operable during self-test, unless a "captive" option (ATA only) is requested.

Long test⚑

Start with a long self test with smartctl. Assuming the disk to test is /dev/sdd:

smartctl -t long /dev/sdd

The command will respond with an estimate of how long it thinks the test will take to complete.

To check progress use:

smartctl -A /dev/sdd | grep remaining
# or
smartctl -c /dev/sdd | grep remaining

Don't check too often because it can abort the test with some drives. If you receive an empty output, examine the reported status with:

smartctl -l selftest /dev/sdd

If errors are shown, check the dmesg as there are usually useful traces of the error.

Understanding the tests⚑

The output of a smartctl command is difficult to read:

smartctl 5.40 2010-03-16 r3077 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F2 EG series
Device Model:     SAMSUNG HD502HI
Serial Number:    S1VZJ9CS712490
Firmware Version: 1AG01118
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Wed Feb  9 15:30:42 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          (6312) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 106) minutes.
Conveyance self-test routine
recommended polling time:      (  12) minutes.
SCT capabilities:            (0x003f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   099   051    Pre-fail  Always       -       2376
  3 Spin_Up_Time            0x0007   091   091   011    Pre-fail  Always       -       3620
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       405
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       717
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       405
 13 Read_Soft_Error_Rate    0x000e   099   099   000    Old_age   Always       -       2375
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       2375
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   084   074   000    Old_age   Always       -       16 (Lifetime Min/Max 16/16)
194 Temperature_Celsius     0x0022   084   071   000    Old_age   Always       -       16 (Lifetime Min/Max 16/16)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       3558
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   098   098   000    Old_age   Always       -       81
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Checking overall health⚑

Somewhere in your report you'll see something like:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing.

That message can also be shown with smartctl -H /dev/sda

Checking the SMART attributes ⚑

Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. But they do not agree on precise attribute definitions and measurement units, the following list of attributes is a general guide only.

If one or more attribute have the "prefailure" flag, and the "current value" of such prefailure attribute is smaller than or equal to its "threshold value" (unless the "threshold value" is 0), that will be reported as a "drive failure". In addition, a utility software can send SMART RETURN STATUS command to the ATA drive, it may report three status: "drive OK", "drive warning" or "drive failure".

SMART attributes columns ⚑

Every of the SMART attributes has several columns as shown by “smartctl -a ”:

ID: The ID number of the attribute, good for comparing with other lists like Wikipedia: S.M.A.R.T.: Known ATA S.M.A.R.T. attributes because the attribute names sometimes differ.
Name: The name of the SMART attribute.
Value: The current, normalized value of the attribute. Higher values are always better (except for temperature for hard disks of some manufacturers). The range is normally 0-100, for some attributes 0-255 (so that 100 resp. 255 is best, 0 is worst). There is no standard on how manufacturers convert their raw value to this normalized one: when the normalized value approaches threshold, it can do linearily, exponentially, logarithmically or any other way, meaning that a doubled normalized value does not necessarily mean “twice as good”.
Worst: The worst (normalized) value that this attribute had at any point of time where SMART was enabled. There seems to be no mechanism to reset current SMART attribute values, but this still makes sense as some SMART attributes, for some manufacturers, fluctuate over time so that keeping the worst one ever is meaningful.
Threshold: The threshold below which the normalized value will be considered “exceeding specifications”. If the attribute type is “Pre-fail”, this means that SMART thinks the hard disk is just before failure. This will “trigger” SMART: setting it from “SMART test passed” to “SMART impending failure” or similar status.
Type: The type of the attribute. Either “Pre-fail” for attributes that are said to indicate impending failure, or “Old_age” for attributes that just indicate wear and tear. Note that one and the same attribute can be classified as “Pre-fail” by one manufacturer or for one model and as “Old_age” by another or for another model. This is the case for example for attribute Seek_Error_Rate (ID 7), which is a widespread phenomenon on many disks and not considered critical by some manufacturers, but Seagate has it as “Pre-fail”.
Raw value: The current raw value that was converted to the normalized value above. smartctl shows all as decimal values, but some attribute values of some manufacturers cannot be reasonably interpreted that way

Reacting to SMART Values ⚑

It is said that a drive that starts getting bad sectors (attribute ID 5) or “pending” bad sectors (attribute ID 197; they most likely are bad, too) will usually be trash in 6 months or less. The only exception would be if this does not happen: that is, bad sector count increases, but then stays stable for a long time, like a year or more. For that reason, one normally needs a diagramming / journaling tool for SMART. Many admins will exchange the hard drive if it gets reallocated sectors (ID 5) or sectors “under investigation” (ID 197)

Critical SMART attributes ⚑

Of all the attributes I'm going to analyse only the critical ones

Read Error Rate⚑

ID: 01 (0x01) Ideal: Low Correlation with probability of failure: not clear

(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number. For some drives, this number may increase during normal operation without necessarily signifying errors.

Reallocated Sectors Count⚑

ID: 05 (0x05) Ideal: Low Correlation with probability of failure: Strong

Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months. If Raw value of 0x05 attribute is higher than its Threshold value, that will reported as "drive warning".

Spin Retry Count⚑

ID: 10 (0x0A) Ideal: Low Correlation with probability of failure: Strong

Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.

Current Pending Sector Count⚑

ID: 197 (0xC5) Ideal: Low Correlation with probability of failure: Strong

Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it has been successfully read.[76]

However, some drives will not immediately remap such sectors when successfully read; instead the drive will first attempt to write to the problem sector, and if the write operation is successful the sector will then be marked as good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors. If Raw value of 0xC5 attribute is higher than its Threshold value, that will reported as "drive warning"

(Offline) Uncorrectable Sector Count⚑

ID: 198 (0xC6) Ideal: Low Correlation with probability of failure: Strong

The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

In the 60 days following the first uncorrectable error on a drive (S.M.A.R.T. attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred.

Non critical SMART attributes ⚑

The next attributes appear to change in the logs but that doesn't mean that there is anything going wrong

Hardware ECC Recovered⚑

ID: 195 (0xC3) Ideal: Varies Correlation with probability of failure: Low

(Vendor-specific raw value.) The raw value has different structure for different vendors and is often not meaningful as a decimal number. For some drives, this number may increase during normal operation without necessarily signifying errors.

Monitorization⚑

To monitor your drive health you can use prometheus with alertmanager for alerts and grafana for dashboards.

Installing the exporter⚑

The prometheus community has it's own smartctl exporter

Using the binary⚑

You can download the latest binary from the repository releases and configure the systemd service

unp smartctl_exporter-0.13.0.linux-amd64.tar.gz
sudo mv smartctl_exporter-0.13.0.linux-amd64/smartctl_exporter /usr/bin

Add the service to /etc/systemd/system/smartctl-exporter.service

[Unit]
Description=smartctl exporter service
After=network-online.target

[Service]
Type=simple
PIDFile=/run/smartctl_exporter.pid
ExecStart=/usr/bin/smartctl_exporter
User=root
Group=root
SyslogIdentifier=smartctl_exporter
Restart=on-failure
RemainAfterExit=no
RestartSec=100ms
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Then enable it:

sudo systemctl enable smartctl-exporter
sudo service smartctl-exporter start

Using docker ⚑

---
services:
  smartctl-exporter:
    container_name: smartctl-exporter
    image: prometheuscommunity/smartctl-exporter
    privileged: true
    user: root
    ports:
      - "9633:9633"

Configuring prometheus⚑

Add the next scraping metrics:

- job_name: smartctl_exporter
  metrics_path: /metrics
  scrape_timeout: 60s
  static_configs:
    - targets: [smartctl-exporter:9633]
      labels:
        hostname: "your-hostname"

Configuring the alerts⚑

Taking as a reference the awesome prometheus rules and this wired post I'm using the next rules:

---
groups:
  - name: smartctl exporter
    rules:
      - alert: SmartDeviceTemperatureWarning
        expr: smartctl_device_temperature > 60
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Smart device temperature warning (instance {{ $labels.hostname }})
          description: "Device temperature  warning (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: SmartDeviceTemperatureCritical
        expr: smartctl_device_temperature > 80
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Smart device temperature critical (instance {{ $labels.hostname }})
          description: "Device temperature critical  (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: SmartCriticalWarning
        expr: smartctl_device_critical_warning > 0
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: Smart critical warning (instance {{ $labels.hostname }})
          description: "device has critical warning (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: SmartNvmeWearoutIndicator
        expr: smartctl_device_available_spare{device=~"nvme.*"} < smartctl_device_available_spare_threshold{device=~"nvme.*"}
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: Smart NVME Wearout Indicator (instance {{ $labels.hostname }})
          description: "NVMe device is wearing out (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: SmartNvmeMediaError
        expr: smartctl_device_media_errors > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: Smart NVME Media errors (instance {{ $labels.hostname }})
          description: "Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: SmartSmartStatusError
        expr: smartctl_device_smart_status < 1
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: Smart general status error (instance {{ $labels.hostname }})
          description: " (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: DiskReallocatedSectorsIncreased
        expr: smartctl_device_attribute{attribute_id="5", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="5", attribute_value_type="raw"}[1h])
        labels:
          severity: warning
        annotations:
          summary: "SMART Attribute Reallocated Sectors Count Increased"
          description: "The SMART attribute 5 (Reallocated Sectors Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: DiskSpinRetryCountIncreased
        expr: smartctl_device_attribute{attribute_id="10", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="10", attribute_value_type="raw"}[1h])
        labels:
          severity: warning
        annotations:
          summary: "SMART Attribute Spin Retry Count Increased"
          description: "The SMART attribute 10 (Spin Retry Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: DiskCurrentPendingSectorCountIncreased
        expr: smartctl_device_attribute{attribute_id="197", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="197", attribute_value_type="raw"}[1h])
        labels:
          severity: warning
        annotations:
          summary: "SMART Attribute Current Pending Sector Count Increased"
          description: "The SMART attribute 197 (Current Pending Sector Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

      - alert: DiskUncorrectableSectorCountIncreased
        expr: smartctl_device_attribute{attribute_id="198", attribute_value_type="raw"} > max_over_time(smartctl_device_attribute{attribute_id="198", attribute_value_type="raw"}[1h])
        labels:
          severity: warning
        annotations:
          summary: "SMART Attribute Uncorrectable Sector Count Increased"
          description: "The SMART attribute 198 (Uncorrectable Sector Count) has increased on {{ $labels.device }} (instance {{ $labels.hostname }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Configuring the grafana dashboards⚑

Of the different grafana dashboards (1, 2, 3) I went for the first one.

Import it with the UI of grafana, make it work and then export the json to store it in your infra as code respository.