13th Week of 2024
Activism⚑
Ludditest⚑
-
New: Nice comic about the luddites.
Life Management⚑
Time management⚑
Time management abstraction levels⚑
-
Correction: Rename Task to Action.
To remove the productivity capitalist load from the concept
Coding⚑
Languages⚑
Bash snippets⚑
-
New: Compare two semantic versions.
This article gives a lot of ways to do it. For my case the simplest is to use
dpkg
to compare two strings in dot-separated version format in bash.Usage: dpkg --compare-versions <condition>
If the condition is
true
, the status code returned bydpkg
will be zero (indicating success). So, we can use this command in anif
statement to compare two version numbers:$ if $(dpkg --compare-versions "2.11" "lt" "3"); then echo true; else echo false; fi true
-
New: Exclude list of extensions from find command.
find . -not \( -name '*.sh' -o -name '*.log' \)
Protocols⚑
-
New: Introduce Python Protocols.
The Python type system supports two ways of deciding whether two objects are compatible as types: nominal subtyping and structural subtyping.
Nominal subtyping is strictly based on the class hierarchy. If class Dog inherits class
Animal
, it’s a subtype ofAnimal
. Instances ofDog
can be used whenAnimal
instances are expected. This form of subtyping subtyping is what Python’s type system predominantly uses: it’s easy to understand and produces clear and concise error messages, and matches how the nativeisinstance
check works – based on class hierarchy.Structural subtyping is based on the operations that can be performed with an object. Class
Dog
is a structural subtype of classAnimal
if the former has all attributes and methods of the latter, and with compatible types.Structural subtyping can be seen as a static equivalent of duck typing, which is well known to Python programmers. See PEP 544 for the detailed specification of protocols and structural subtyping in Python.
Usage
You can define your own protocol class by inheriting the special Protocol class:
from typing import Iterable from typing_extensions import Protocol class SupportsClose(Protocol): # Empty method body (explicit '...') def close(self) -> None: ... class Resource: # No SupportsClose base class! def close(self) -> None: self.resource.release() # ... other methods ... def close_all(items: Iterable[SupportsClose]) -> None: for item in items: item.close() close_all([Resource(), open('some/file')]) # OK
Resource
is a subtype of theSupportsClose
protocol since it defines a compatible close method. Regular file objects returned byopen()
are similarly compatible with the protocol, as they supportclose()
.If you want to define a docstring on the method use the next syntax:
def load(self, filename: Optional[str] = None) -> None: """Load a configuration file.""" ...
Make protocols work with
isinstance
To check an instance against the protocol usingisinstance
, we need to decorate our protocol with@runtime_checkable
Make a protocol property variable
References - Mypy article on protocols - Predefined protocols reference
FastAPI⚑
-
New: Launch the server from within python.
import uvicorn if __name__ == "__main__": uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)
-
New: Add the request time to the logs.
For more information on changing the logging read 1
To set the datetime of the requests use this configuration
@asynccontextmanager async def lifespan(api: FastAPI): logger = logging.getLogger("uvicorn.access") console_formatter = uvicorn.logging.ColourizedFormatter( "{asctime} {levelprefix} : {message}", style="{", use_colors=True ) logger.handlers[0].setFormatter(console_formatter) yield api = FastAPI(lifespan=lifespan)
Python Snippets⚑
-
New: Expire the cache of the lru_cache.
The
lru_cache
decorator caches forever, a way to prevent it is by adding one more parameter to your expensive function:ttl_hash=None
. This new parameter is so-called "time sensitive hash", its the only purpose is to affect lru_cache. For example:from functools import lru_cache import time @lru_cache() def my_expensive_function(a, b, ttl_hash=None): del ttl_hash # to emphasize we don't use it and to shut pylint up return a + b # horrible CPU load... def get_ttl_hash(seconds=3600): """Return the same value withing `seconds` time period""" return round(time.time() / seconds) res = my_expensive_function(2, 2, ttl_hash=get_ttl_hash())
Goodconf⚑
-
New: Initialize the config with a default value if the file doesn't exist.
feat(goodconf#Config saving)def load(self, filename: Optional[str] = None) -> None: self._config_file = filename if not self.store_dir.is_dir(): log.warning("The store directory doesn't exist. Creating it") os.makedirs(str(self.store_dir)) if not Path(self.config_file).is_file(): log.warning("The yaml store file doesn't exist. Creating it") self.save() super().load(filename)
So far
goodconf
doesn't support saving the config. Until it's ready you can use the next snippet:feat(google_chrome#Open a specific profile): Open a specific profileclass YamlStorage(GoodConf): """Adapter to store and load information from a yaml file.""" @property def config_file(self) -> str: """Return the path to the config file.""" return str(self._config_file) @property def store_dir(self) -> Path: """Return the path to the store directory.""" return Path(self.config_file).parent def reload(self) -> None: """Reload the contents of the authentication store.""" self.load(self.config_file) def load(self, filename: Optional[str] = None) -> None: """Load a configuration file.""" if not filename: filename = f"{self.store_dir}/data.yaml" super().load(self.config_file) def save(self) -> None: """Save the contents of the authentication store.""" with open(self.config_file, "w+", encoding="utf-8") as file_cursor: yaml = YAML() yaml.default_flow_style = False yaml.dump(self.dict(), file_cursor)
google-chrome --profile-directory="Profile Name"
Where
Profile Name
is one of the profiles listed underls ~/.config/chromium | grep -i profile
.
Python Mysql⚑
-
New: Get the last row of a table.
SELECT * FROM Table ORDER BY ID DESC LIMIT 1
IDES⚑
Git management configuration⚑
-
New: Update all git submodules.
If it's the first time you check-out a repo you need to use
--init
first:git submodule update --init --recursive
To update to latest tips of remote branches use:
git submodule update --recursive --remote
DevOps⚑
Infrastructure as Code⚑
Ansible Snippets⚑
-
New: Filter json data.
To select a single element or a data subset from a complex data structure in JSON format (for example, Ansible facts), use the
community.general.json_query
filter. Thecommunity.general.json_query
filter lets you query a complex JSON structure and iterate over it using a loop structure.This filter is built upon jmespath, and you can use the same syntax. For examples, see jmespath examples.
A complex example would be:
"{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}"
This snippet:
- Gets all dictionaries under the block_device_mappings list which
device_name
is not equal to/dev/sda1
or/dev/xvda
- From those results it extracts and flattens only the desired values. In this case
device_name
and theid
which is at the keyebs.volume_id
of each of the items of the block_device_mappings list.
- Gets all dictionaries under the block_device_mappings list which
-
New: Do asserts.
- name: After version 2.7 both 'msg' and 'fail_msg' can customize failing assertion message ansible.builtin.assert: that: - my_param <= 100 - my_param >= 0 fail_msg: "'my_param' must be between 0 and 100" success_msg: "'my_param' is between 0 and 100"
-
New: Split a variable in ansible.
{{ item | split ('@') | last }}
-
New: Get a list of EC2 volumes mounted on an instance an their mount points.
Assuming that each volume has a tag
mount_point
you could:- name: Gather EC2 instance metadata facts amazon.aws.ec2_metadata_facts: - name: Gather info on the mounted disks delegate_to: localhost block: - name: Gather information about the instance amazon.aws.ec2_instance_info: instance_ids: - "{{ ansible_ec2_instance_id }}" register: ec2_facts - name: Gather volume tags amazon.aws.ec2_vol_info: filters: volume-id: "{{ item.id }}" # We exclude the root disk as they are already mounted and formatted loop: "{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}" register: volume_tags_data - name: Save the required volume data set_fact: volumes: "{{ volume_tags_data | json_query('results[0].volumes[].{id: id, mount_point: tags.mount_point}') }}" - name: Display volumes data debug: msg: "{{ volumes }}" - name: Make sure that all volumes have a mount point assert: that: - item.mount_point is defined - item.mount_point|length > 0 fail_msg: "Configure the 'mount_point' tag on the volume {{ item.id }} on the instance {{ ansible_ec2_instance_id }}" success_msg: "The volume {{ item.id }} has the mount_point tag well set" loop: "{{ volumes }}"
-
New: Create a list of dictionaries using ansible.
- name: Create and Add items to dictionary set_fact: userdata: "{{ userdata | default({}) | combine ({ item.key : item.value }) }}" with_items: - { 'key': 'Name' , 'value': 'SaravAK'} - { 'key': 'Email' , 'value': 'sarav@gritfy.com'} - { 'key': 'Location' , 'value': 'Coimbatore'} - { 'key': 'Nationality' , 'value': 'Indian'}
-
New: Merge two dictionaries on a key.
If you have these two lists:
And want to merge them using the value of key "a":"list1": [ { "a": "b", "c": "d" }, { "a": "e", "c": "f" } ] "list2": [ { "a": "e", "g": "h" }, { "a": "b", "g": "i" } ]
"list3": [ { "a": "b", "c": "d", "g": "i" }, { "a": "e", "c": "f", "g": "h" } ]
If you can install the collection community.general use the filter lists_mergeby. The expression below gives the same result
list3: "{{ list1|community.general.lists_mergeby(list2, 'a') }}"
Storage⚑
NAS⚑
-
Correction: Add suggestions when buying a motherboard.
When choosing a motherboard make sure that:
- If you want ECC that it truly supports ECC.
- It is IPMI compliant, if you want to have hardware watchdog support
OpenZFS⚑
-
New: Monitor dbgmsg with loki.
If you use loki remember to monitor the
/proc/spl/kstat/zfs/dbgmsg
file:- job_name: zfs static_configs: - targets: - localhost labels: job: zfs __path__: /proc/spl/kstat/zfs/dbgmsg
-
Correction: Add loki alerts on the kernel panic error.
You can monitor this issue with loki using the next alerts:
groups: - name: zfs rules: - alert: SlowSpaSyncZFSError expr: | count_over_time({job="zfs"} |~ `spa_deadman.*slow spa_sync` [5m]) for: 1m labels: severity: critical annotations: summary: "Slow sync traces found in the ZFS debug logs at {{ $labels.hostname}}" message: "This usually happens before the ZFS becomes unresponsible"
-
New: Monitorization.
You can monitor this issue with loki using the next alerts:
groups: - name: zfs rules: - alert: ErrorInSanoidLogs expr: | count_over_time({job="systemd-journal", syslog_identifier="sanoid"} |= `ERROR` [5m]) for: 1m labels: severity: critical annotations: summary: "Errors found on sanoid log at {{ $labels.hostname}}"
Resilience⚑
-
New: Introduce linux resilience.
Increasing the resilience of the servers is critical when hosting services for others. This is the roadmap I'm following for my servers.
Autostart services if the system reboots Using init system services to manage your services
**Get basic metrics traceability and alerts ** Set up Prometheus with:
- The blackbox exporter to track if the services are available to your users and to monitor SSL certificates health.
- The node exporter to keep track on the resource usage of your machines and set alerts to get notified when concerning events happen (disks are getting filled, CPU usage is too high)
**Get basic logs traceability and alerts **
Set up Loki and clear up your system log errors.
Improve the resilience of your data If you're still using
ext4
for your filesystems instead ofzfs
you're missing a big improvement. To set it up:- Plan your zfs storage architecture
- Install ZFS
- Create ZFS local and remote backups
- [Monitor your ZFS ]
Automatically react on system failures - Kernel panics - watchdog
Future undeveloped improvements - Handle the system reboots after kernel upgrades
Memtest⚑
-
New: Introduce memtest.
memtest86 is a testing software for RAM.
Installation
apt-get install memtest86+
After the installation you'll get Memtest entries in grub which you can spawn.
For some unknown reason the memtest of the boot menu didn't work for me. So I downloaded the latest free version of memtest (It's at the bottom of the screen), burnt it in a usb and booted from there.
Usage It will run by itself. For 64GB of ECC RAM it took aproximately 100 minutes to run all the tests.
Check ECC errors MemTest86 directly polls ECC errors logged in the chipset/memory controller registers and displays it to the user on-screen. In addition, ECC errors are written to the log and report file.
watchdog⚑
-
New: Introduce the watchdog.
A watchdog timer (WDT, or simply a watchdog), sometimes called a computer operating properly timer (COP timer), is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation.
During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective actions. The corrective actions typically include placing the computer and associated hardware in a safe state and invoking a computer reboot.
Microcontrollers often include an integrated, on-chip watchdog. In other computers the watchdog may reside in a nearby chip that connects directly to the CPU, or it may be located on an external expansion card in the computer's chassis.
Hardware watchdog
Before you start using the hardware watchdog you need to check if your hardware actually supports it.
If you see Watchdog hardware is disabled error on boot things are not looking good.
Check if the hardware watchdog is enabled You can see if hardware watchdog is loaded by running
wdctl
. For example for a machine that has it enabled you'll see:Device: /dev/watchdog0 Identity: iTCO_wdt [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds Timeleft: 30 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0
On a machine that doesn't you'll see:
wdctl: No default device is available.: No such file or directory
Another option is to run
dmesg | grep wd
ordmesg | grep watc -i
. For example for a machine that has enabled the hardware watchdog you'll see something like:[ 20.708839] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 [ 20.708894] iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400) [ 20.709009] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
For one that is not you'll see:
[ 1.934999] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver [ 1.935057] sp5100-tco sp5100-tco: Using 0xfed80b00 for watchdog MMIO address [ 1.935062] sp5100-tco sp5100-tco: Watchdog hardware is disabled
If you're out of luck and your hardware doesn't support it you can delegate the task to the software watchdog or get some usb watchdog
Starting with version 183 systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support for invidual system services. The basic idea is the following: if enabled, systemd will regularly ping the watchdog hardware. If systemd or the kernel hang this ping will not happen anymore and the hardware will automatically reset the system. This way systemd and the kernel are protected from boundless hangs -- by the hardware. To make the chain complete, systemd then exposes a software watchdog interface for individual services so that they can also be restarted (or some other action taken) if they begin to hang. This software watchdog logic can be configured individually for each service in the ping frequency and the action to take. Putting both parts together (i.e. hardware watchdogs supervising systemd and the kernel, as well as systemd supervising all other services) we have a reliable way to watchdog every single component of the system.
Configuring the watchdog To make use of the hardware watchdog it is sufficient to set the
RuntimeWatchdogSec=
option in/etc/systemd/system.conf
. It defaults to0
(i.e. no hardware watchdog use). Set it to a value like20s
and the watchdog is enabled. After 20s of no keep-alive pings the hardware will reset itself. Note thatsystemd
will send a ping to the hardware at half the specified interval, i.e. every 10s.Note that the hardware watchdog device (
/dev/watchdog
) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon, such as the aptly namedwatchdog
. Although the built-in hardware watchdog support of systemd does not conflict with other watchdog software by default. systemd does not make use of/dev/watchdog
by default, and you are welcome to use external watchdog daemons in conjunction with systemd, if this better suits your needs.ShutdownWatchdogSec=`` is another option that can be configured in
/etc/systemd/system.conf`. It controls the watchdog interval to use during reboots. It defaults to 10min, and adds extra reliability to the system reboot logic: if a clean reboot is not possible and shutdown hangs, we rely on the watchdog hardware to reset the system abruptly, as extra safety net.Now, let's have a look how to add watchdog logic to individual services.
First of all, to make software watchdog-supervisable it needs to be patched to send out "I am alive" signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the
WATCHDOG_USEC=
environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issuesd_notify("WATCHDOG=1")
calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the
WatchdogSec=
to the desired failure latency. Seesystemd.service(5)
for details on this setting. This causesWATCHDOG_USEC=
to be set for the service's processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set
Restart=on-failure
for the service. To configure how many times a service shall be attempted to be restarted use the combination ofStartLimitBurst=
andStartLimitInterval=
which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured withStartLimitAction=
. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values arereboot
,reboot-force
andreboot-immediate
.reboot
attempts a clean reboot, going through the usual, clean shutdown logic.reboot-force
is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast).reboot-immediate
does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay.reboot-immediate
hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented insystemd.service(5)
.
Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn't help.
Here's an example unit file:
[Unit] Description=My Little Daemon Documentation=man:mylittled(8) [Service] ExecStart=/usr/bin/mylittled WatchdogSec=30s Restart=on-failure StartLimitInterval=5min StartLimitBurst=4 StartLimitAction=reboot-force ```` This service will automatically be restarted if it hasn't pinged the system manager for longer than 30s or if it fails otherwise. If it is restarted this way more often than 4 times in 5min action is taken and the system quickly rebooted, with all file systems being clean when it comes up again. To write the code of the watchdog service you can follow one of these guides: - [Python based watchdog](https://sleeplessbeastie.eu/2022/08/15/how-to-create-watchdog-for-systemd-service/) - [Bash based watchdog](https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/) **[Testing a watchdog](https://serverfault.com/questions/375220/how-to-check-what-if-hardware-watchdogs-are-available-in-linux)** One simple way to test a watchdog is to trigger a kernel panic. This can be done as root with: ```bash echo c > /proc/sysrq-trigger
The kernel will stop responding to the watchdog pings, so the watchdog will trigger.
SysRq is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. It can also be used by echoing letters to /proc/sysrq-trigger, like we're doing here.
In this case, the letter c means perform a system crash and take a crashdump if configured.
Troubleshooting
Watchdog hardware is disabled error on boot
According to the discussion at the kernel mailing list it means that the system contains hardware watchdog but it has been disabled (probably by BIOS) and Linux cannot enable the hardware.
If your BIOS doesn't have a switch to enable it, consider the watchdog hardware broken for your system.
Some people are blacklisting the module so that it's not loaded and therefore it doesn't return the error (1, 2
References
Monitoring⚑
Loki⚑
-
Correction: Use
fake
when using one loki instance in docker.If you only have one Loki instance you need to save the rule yaml files in the
/etc/loki/rules/fake/
otherwise Loki will silently ignore them (it took me a lot of time to figure this out-.-
). -
New: Add alerts.
Surprisingly I haven't found any compilation of Loki alerts. I'll gather here the ones I create.
There are two kinds of rules: alerting rules and recording rules.
Promtail⚑
-
Use patrickjahns ansible role. Some interesting variables are:
loki_url: localhost promtail_system_user: root promtail_config_clients: - url: "http://{{ loki_url }}:3100/loki/api/v1/push" external_labels: hostname: "{{ ansible_hostname }}"
-
New: Troubleshooting promtail.
Find where is the
positions.yaml
file and see if it evolves.Sometimes if you are not seeing the logs in loki it's because the query you're running is not correct.
Process Exporter⚑
-
New: Introduce the process exporter.
process_exporter
is a rometheus exporter that mines /proc to report on selected processes.References - Source - Grafana dashboard
Hardware⚑
ECC RAM⚑
-
New: Introduce ECC RAM.
Error Correction Code (ECC) is a mechanism used to detect and correct errors in memory data due to environmental interference and physical defects. ECC memory is used in high-reliability applications that cannot tolerate failure due to corrupted data.
Installation: Due to additional circuitry required for ECC protection, specialized ECC hardware support is required by the CPU chipset, motherboard and DRAM module. This includes the following:
- Server-grade CPU chipset with ECC support (Intel Xeon, AMD Ryzen)
- Motherboard supporting ECC operation
- ECC RAM
Consult the motherboard and/or CPU documentation for the specific model to verify whether the hardware supports ECC. Use vendor-supplied list of certified ECC RAM, if provided.
Most ECC-supported motherboards allow you to configure ECC settings from the BIOS setup. They are usually on the Advanced tab. The specific option depends on the motherboard vendor or model such as the following:
- DRAM ECC Enable (American Megatrends, ASUS, ASRock, MSI)
- ECC Mode (ASUS)
Monitorization
The mechanism for how ECC errors are logged and reported to the end-user depends on the BIOS and operating system. In most cases, corrected ECC errors are written to system/event logs. Uncorrected ECC errors may result in kernel panic or blue screen.
The Linux kernel supports reporting ECC errors for ECC memory via the EDAC (Error Detection And Correction) driver subsystem. Depending on the Linux distribution, ECC errors may be reported by the following:
rasdaemon
: monitor ECC memory and report both correctable and uncorrectable memory errors on recent Linux kernels.mcelog
(Deprecated): collects and decodes MCA error events on x86.edac-utils
(Deprecated): fills DIMM labels data and summarizes memory errors.
To configure rasdaemon follow this article
Confusion on boards supporting ECC
I've read that even if some motherboards say that they "Support ECC" some of them don't do anything with it.
On this post and the kernel docs show that you should see references to ACPI/WHEA in the specs manual. Ideally ACPI5 support.
From the ) EINJ provides a hardware error injection mechanism. It is very useful for debugging and testing APEI and RAS features in general.
You need to check whether your BIOS supports EINJ first. For that, look for early boot messages similar to this one:
ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001)
Which shows that the BIOS is exposing an EINJ table - it is the mechanism through which the injection is done.
Alternatively, look in
/sys/firmware/acpi/tables
for an "EINJ" file, which is a different representation of the same thing.It doesn't necessarily mean that EINJ is not supported if those above don't exist: before you give up, go into BIOS setup to see if the BIOS has an option to enable error injection. Look for something called
WHEA
or similar. Often, you need to enable anACPI5
support option prior, in order to see theAPEI
,EINJ
,... functionality supported and exposed by the BIOS menu.To use
EINJ
, make sure the following are options enabled in your kernel configuration:CONFIG_DEBUG_FS CONFIG_ACPI_APEI CONFIG_ACPI_APEI_EINJ
One way to test it can be to run memtest as it sometimes shows ECC errors such as
** Warning** ECC injection may be disabled for AMD Ryzen (70h-7fh)
.Other people (1, 2 say that there are a lot of motherboards that NEVER report any corrected errors to the OS. In order to see corrected errors, PFEH (Platform First Error Handling) has to be disabled. On some motherboards and FW versions this setting is hidden from the user and always enabled, thus resulting in zero correctable errors getting reported.
They also suggest to disable "Quick Boot". In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled.
The people behind memtest have a paid tool to test ECC
-
New: Introduce rasdaemon the ECC monitor.
rasdaemon
is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.Installation
apt-get install rasdaemon
The output will be available via syslog but you can show it to the foreground (
-f
) or to an sqlite3 database (-r
)To post-process and decode received MCA errors on AMD SMCA systems, run:
rasdaemon -p --status <STATUS_reg> --ipid <IPID_reg> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
Status and IPID Register values (in hex) are mandatory. The smca flag with family and model are required if not decoding locally. Bank parameter is optional.
You may also start it via systemd:
systemctl start rasdaemon
The rasdaemon will then output the messages to journald.
At this point
rasdaemon
should already be running on your system. You can now use theras-mc-ctl
tool to query the errors that have been detected. If everything is well configured you'll see something like:$: ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0
If it's not you'll see:
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.
The
CE
column represents the number of corrected errors for a given DIMM,UE
represents uncorrectable errors that were detected. The label on the left shows the EDAC path under/sys/devices/system/edac/mc/
of every DIMM. This is not very readable, if you wish to improve the labeling read this articleMore ways to check is to run:
$: ras-mc-ctl --status ras-mc-ctl: drivers are loaded.
You can also see a summary of the state with:
$: ras-mc-ctl --summary No Memory errors. No PCIe AER errors. No Extlog errors. DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183. Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.
Monitorization
You can use loki to monitor ECC errors shown in the logs with the next alerts:
Referencesgroups: - name: ecc rules: - alert: ECCError expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level="error"} [5m]) > 0 for: 1m labels: severity: critical annotations: summary: "Possible ECC error detected in {{ $labels.hostname}}" - alert: ECCWarning expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level="warning"} [5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Possible ECC warning detected in {{ $labels.hostname}}" - alert: ECCAlert expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level!~"info|error|warning"} [5m]) > 0 for: 1m labels: severity: info annotations: summary: "ECC log trace with unknown severity level detected in {{ $labels.hostname}}"
Operating Systems⚑
Linux⚑
Linux Snippets⚑
-
New: Send multiline messages with notify-send.
The title can't have new lines, but the body can.
feat(linux_snippets#Find BIOS version): Find BIOS versionnotify-send "Title" "This is the first line.\nAnd this is the second.")
dmidecode | less
-
New: Reboot server on kernel panic.
The
proc/sys/kernel/panic
file gives read/write access to the kernel variablepanic_timeout
. If this is zero, the kernel will loop on a panic; if nonzero it indicates that the kernel should autoreboot after this number of seconds. When you use the software watchdog device driver, the recommended setting is60
.To set the value add the next contents to the
/etc/sysctl.d/99-panic.conf
kernel.panic = 60
Or with an ansible task:
- name: Configure reboot on kernel panic become: true lineinfile: path: /etc/sysctl.d/99-panic.conf line: kernel.panic = 60 create: true state: present
-
New: Share a calculated value between github actions steps.
You need to set a step's output parameter. Note that the step will need an
id
to be defined to later retrieve the output value.echo "{name}={value}" >> "$GITHUB_OUTPUT"
For example:
- name: Set color id: color-selector run: echo "SELECTED_COLOR=green" >> "$GITHUB_OUTPUT" - name: Get color env: SELECTED_COLOR: ${{ steps.color-selector.outputs.SELECTED_COLOR }} run: echo "The selected color is $SELECTED_COLOR"
-
New: Split a zip into sizes with restricted size.
Something like:
zip -9 myfile.zip * zipsplit -n 250000000 myfile.zip
Would produce
myfile1.zip
,myfile2.zip
, etc., all independent of each other, and none larger than 250MB (in powers of ten).zipsplit
will even try to organize the contents so that each resulting archive is as close as possible to the maximum size. -
New: Find files that were modified between dates.
The best option is the
-newerXY
. The m and t flags can be used.m
The modification time of the file referencet
reference is interpreted directly as a time
So the solution is
find . -type f -newermt 20111222 \! -newermt 20111225
The lower bound in inclusive, and upper bound is exclusive, so I added 1 day to it. And it is recursive.
Magic keys⚑
-
New: Introduce the Magic Keys.
The magic SysRq key is a key combination understood by the Linux kernel, which allows the user to perform various low-level commands regardless of the system's state. It is often used to recover from freezes, or to reboot a computer without corrupting the filesystem.[1] Its effect is similar to the computer's hardware reset button (or power switch) but with many more options and much more control.
This key combination provides access to powerful features for software development and disaster recovery. In this sense, it can be considered a form of escape sequence. Principal among the offered commands are means to forcibly unmount file systems, kill processes, recover keyboard state, and write unwritten data to disk. With respect to these tasks, this feature serves as a tool of last resort.
The magic SysRq key cannot work under certain conditions, such as a kernel panic[2] or a hardware failure preventing the kernel from running properly.
The key combination consists of Alt+Sys Req and another key, which controls the command issued.
On some devices, notably laptops, the Fn key may need to be pressed to use the magic SysRq key.
Reboot the machine
A common use of the magic SysRq key is to perform a safe reboot of a Linux computer which has otherwise locked up (abbr. REISUB). This can prevent a fsck being required on reboot and gives some programs a chance to save emergency backups of unsaved work. The QWERTY (or AZERTY) mnemonics: "Raising Elephants Is So Utterly Boring", "Reboot Even If System Utterly Broken" or simply the word "BUSIER" read backwards, are often used to remember the following SysRq-keys sequence:
- unRaw (take control of keyboard back from X),
- tErminate (send SIGTERM to all processes, allowing them to terminate gracefully),
- kIll (send SIGKILL to all processes, forcing them to terminate immediately),
- Sync (flush data to disk),
- Unmount (remount all filesystems read-only),
- reBoot.
When magic SysRq keys are used to kill a frozen graphical program, the program has no chance to restore text mode. This can make everything unreadable. The commands textmode (part of SVGAlib) and the reset command can restore text mode and make the console readable again.
On distributions that do not include a textmode executable, the key command Ctrl+Alt+F1 may sometimes be able to force a return to a text console. (Use F1, F2, F3,..., F(n), where n is the highest number of text consoles set up by the distribution. Ctrl+Alt+F(n+1) would normally be used to reenter GUI mode on a system on which the X server has not crashed.)
Interact with the sysrq through the commandline It can also be used by echoing letters to
/proc/sysrq-trigger
, for example to trigger a system crash and take a crashdump you can:echo c > /proc/sysrq-trigger
Pass⚑
-
New: Add rofi launcher.
pass is a command line password store
Configure rofi launcher
- Save this script somewhere in your
$PATH
- Configure your window manager to launch it whenever you need a password.
- Save this script somewhere in your
rofi⚑
-
New: Introduce Rofi.
Rofi is a window switcher, application launcher and dmenu replacement.
sudo apt-get install rofi
Usage To launch rofi directly in a certain mode, specify a mode with
rofi -show <mode>
. To show the run dialog:rofi -show run
Or get the options from a script:
~/my_script.sh | rofi -dmenu
Specify an ordered, comma-separated list of modes to enable. Enabled modes can be changed at runtime. Default key is Ctrl+Tab. If no modes are specified, all configured modes will be enabled. To only show the run and ssh launcher:
rofi -modes "run,ssh" -show run
The modes to combine in combi mode. For syntax to
-combi-modes
, see-modes
. To get one merge view, of window,run, and ssh:rofi -show combi -combi-modes "window,run,ssh" -modes combi
The configuration lives at
~/.config/rofi/config.rasi
to create this file with the default conf run:rofi -dump-config > ~/.config/rofi/config.rasi
To run once:
rofi -show run -sorting-method fzf -matching fuzzy
To persist them change those same values in the configuration.
Theme changing To change the theme: - Choose the one you like most looking here - Run
rofi-theme-selector
to select it - Accept it withAlt + a
Plugins You can write your custom plugins. If you're on python using
python-rofi
seems to be the best option although it looks unmaintained.Some interesting examples are:
- Python based plugin
- Creation of nice menus
- Nice collection of possibilities
- Date picker
- Orgmode capture
Other interesting references are:
Android⚑
Signal⚑
-
New: Use the Molly FOSS android client.
Molly is an independent Signal fork for Android. The advantages are:
- Contains no proprietary blobs, unlike Signal.
- Protects database with passphrase encryption.
- Locks down the app automatically when you are gone for a set period of time.
- Securely shreds sensitive data from RAM.
- Automatic backups on a daily or weekly basis.
- Supports SOCKS proxy and Tor via Orbot.
Note, the migration should be done when the available Molly version is equal to or later than the currently installed Signal app version.
- Verify your Signal backup passphrase. In the Signal app: Settings > Chats > Chat backups > Verify backup passphrase.
- Optionally, put your phone offline (enable airplane mode or disable data services) until after Signal is uninstalled in step 5. This will prevent the possibility of losing any Signal messages that are received during or after the backup is created.
- Create a Signal backup. In the Signal app, go to Settings > Chats > Chat backups > Create backup.
- Uninstall the Signal app. Now you can put your phone back online (disable airplane mode or re-enable data services).
- Install the Molly or Molly-FOSS app.
- Open the Molly app. Enable database encryption if desired. As soon as the option is given, tap Transfer or restore account. Answer any permissions questions.
- Choose to Restore from backup and tap Choose backup. Navigate to your Signal backup location (Signal/Backups/, by default) and choose the backup that was created in step 3.
- Check the backup details and then tap Restore backup to confirm. Enter the backup passphrase when requested.
- If asked, choose a new folder for backup storage. Or choose Not Now and do it later.
Consider also:
- Any previously linked devices will need to be re-linked. Go to Settings > Linked devices in the Molly app. If Signal Desktop is not detecting that it is no longer linked, try restarting it.
- Verify your Molly backup settings and passphrase at Settings > Chats > Chat backups (to change the backup folder, disable and then enable backups). Tap Create backup to create your first Molly backup.
- When you are satisfied that Molly is working, you may want to delete the old Signal backups (in Signal/Backups, by default).
Other⚑
- New: Add 2024 Hidden Cup 5 awesome match.