rasdaemon
rasdaemon
is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.
Installation⚑
apt-get install rasdaemon
The output will be available via syslog but you can show it to the foreground (-f
) or to an sqlite3 database (-r
)
To post-process and decode received MCA errors on AMD SMCA systems, run:
rasdaemon -p --status <STATUS_reg> --ipid <IPID_reg> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
Status and IPID Register values (in hex) are mandatory. The smca flag with family and model are required if not decoding locally. Bank parameter is optional.
You may also start it via systemd:
systemctl start rasdaemon
The rasdaemon will then output the messages to journald.
Usage⚑
At this point rasdaemon
should already be running on your system. You can now use the ras-mc-ctl
tool to query the errors that have been detected. If everything is well configured you'll see something like:
$: ras-mc-ctl --error-count
Label CE UE
mc#0csrow#2channel#0 0 0
mc#0csrow#2channel#1 0 0
mc#0csrow#3channel#1 0 0
mc#0csrow#3channel#0 0 0
If it's not you'll see:
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.
The CE
column represents the number of corrected errors for a given DIMM, UE
represents uncorrectable errors that were detected. The label on the left shows the EDAC path under /sys/devices/system/edac/mc/
of every DIMM. This is not very readable, if you wish to improve the labeling read this article
More ways to check is to run:
$: ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
You can also see a summary of the state with:
$: ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.
Monitorization⚑
You can use loki to monitor ECC errors shown in the logs with the next alerts:
groups:
- name: ecc
rules:
- alert: ECCError
expr: |
count_over_time({job="systemd-journal", unit="rasdaemon.service", level="error"} [5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Possible ECC error detected in {{ $labels.hostname}}"
- alert: ECCWarning
expr: |
count_over_time({job="systemd-journal", unit="rasdaemon.service", level="warning"} [5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Possible ECC warning detected in {{ $labels.hostname}}"
- alert: ECCAlert
expr: |
count_over_time({job="systemd-journal", unit="rasdaemon.service", level!~"info|error|warning"} [5m]) > 0
for: 1m
labels:
severity: info
annotations:
summary: "ECC log trace with unknown severity level detected in {{ $labels.hostname}}"