Skip to content

26th March 2024

Activism

Ludditest

Life Management

Time management

Time management abstraction levels

  • Correction: Rename Task to Action.

    To remove the productivity capitalist load from the concept

Coding

Languages

Bash snippets

  • New: Compare two semantic versions.

    This article gives a lot of ways to do it. For my case the simplest is to use dpkg to compare two strings in dot-separated version format in bash.

    Usage: dpkg --compare-versions <condition>
    

    If the condition is true, the status code returned by dpkg will be zero (indicating success). So, we can use this command in an if statement to compare two version numbers:

    $ if $(dpkg --compare-versions "2.11" "lt" "3"); then echo true; else echo false; fi
    true
    

  • New: Exclude list of extensions from find command.

    find . -not \( -name '*.sh' -o -name '*.log' \)
    

Protocols

  • New: Introduce Python Protocols.

    The Python type system supports two ways of deciding whether two objects are compatible as types: nominal subtyping and structural subtyping.

    Nominal subtyping is strictly based on the class hierarchy. If class Dog inherits class Animal, it’s a subtype of Animal. Instances of Dog can be used when Animal instances are expected. This form of subtyping subtyping is what Python’s type system predominantly uses: it’s easy to understand and produces clear and concise error messages, and matches how the native isinstance check works – based on class hierarchy.

    Structural subtyping is based on the operations that can be performed with an object. Class Dog is a structural subtype of class Animal if the former has all attributes and methods of the latter, and with compatible types.

    Structural subtyping can be seen as a static equivalent of duck typing, which is well known to Python programmers. See PEP 544 for the detailed specification of protocols and structural subtyping in Python.

    Usage

    You can define your own protocol class by inheriting the special Protocol class:

    from typing import Iterable
    from typing_extensions import Protocol
    
    class SupportsClose(Protocol):
        # Empty method body (explicit '...')
        def close(self) -> None: ...
    
    class Resource:  # No SupportsClose base class!
    
        def close(self) -> None:
           self.resource.release()
    
        # ... other methods ...
    
    def close_all(items: Iterable[SupportsClose]) -> None:
        for item in items:
            item.close()
    
    close_all([Resource(), open('some/file')])  # OK
    

    Resource is a subtype of the SupportsClose protocol since it defines a compatible close method. Regular file objects returned by open() are similarly compatible with the protocol, as they support close().

    If you want to define a docstring on the method use the next syntax:

        def load(self, filename: Optional[str] = None) -> None:
            """Load a configuration file."""
            ...
    

    Make protocols work with isinstance To check an instance against the protocol using isinstance, we need to decorate our protocol with @runtime_checkable

    Make a protocol property variable

    Make protocol of functions

    References - Mypy article on protocols - Predefined protocols reference

FastAPI

  • New: Launch the server from within python.

    import uvicorn
    if __name__ == "__main__":
      uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)
    
  • New: Add the request time to the logs.

    For more information on changing the logging read 1

    To set the datetime of the requests use this configuration

    @asynccontextmanager
    async def lifespan(api: FastAPI):
        logger = logging.getLogger("uvicorn.access")
        console_formatter = uvicorn.logging.ColourizedFormatter(
            "{asctime} {levelprefix} : {message}", style="{", use_colors=True
        )
        logger.handlers[0].setFormatter(console_formatter)
        yield
    
    api = FastAPI(lifespan=lifespan)
    

Python Snippets

  • New: Investigate a class attributes.

    Investigate a class attributes with inspect

  • New: Expire the cache of the lru_cache.

    The lru_cache decorator caches forever, a way to prevent it is by adding one more parameter to your expensive function: ttl_hash=None. This new parameter is so-called "time sensitive hash", its the only purpose is to affect lru_cache. For example:

    from functools import lru_cache
    import time
    
    @lru_cache()
    def my_expensive_function(a, b, ttl_hash=None):
        del ttl_hash  # to emphasize we don't use it and to shut pylint up
        return a + b  # horrible CPU load...
    
    def get_ttl_hash(seconds=3600):
        """Return the same value withing `seconds` time period"""
        return round(time.time() / seconds)
    
    res = my_expensive_function(2, 2, ttl_hash=get_ttl_hash())
    

Goodconf

  • New: Initialize the config with a default value if the file doesn't exist.

        def load(self, filename: Optional[str] = None) -> None:
            self._config_file = filename
            if not self.store_dir.is_dir():
                log.warning("The store directory doesn't exist. Creating it")
                os.makedirs(str(self.store_dir))
            if not Path(self.config_file).is_file():
                log.warning("The yaml store file doesn't exist. Creating it")
                self.save()
            super().load(filename)
    
    feat(goodconf#Config saving)

    So far goodconf doesn't support saving the config. Until it's ready you can use the next snippet:

    class YamlStorage(GoodConf):
        """Adapter to store and load information from a yaml file."""
    
        @property
        def config_file(self) -> str:
            """Return the path to the config file."""
            return str(self._config_file)
    
        @property
        def store_dir(self) -> Path:
            """Return the path to the store directory."""
            return Path(self.config_file).parent
    
        def reload(self) -> None:
            """Reload the contents of the authentication store."""
            self.load(self.config_file)
    
        def load(self, filename: Optional[str] = None) -> None:
            """Load a configuration file."""
            if not filename:
                filename = f"{self.store_dir}/data.yaml"
            super().load(self.config_file)
    
        def save(self) -> None:
            """Save the contents of the authentication store."""
            with open(self.config_file, "w+", encoding="utf-8") as file_cursor:
                yaml = YAML()
                yaml.default_flow_style = False
                yaml.dump(self.dict(), file_cursor)
    
    feat(google_chrome#Open a specific profile): Open a specific profile

    google-chrome --profile-directory="Profile Name"
    

    Where Profile Name is one of the profiles listed under ls ~/.config/chromium | grep -i profile.

Python Mysql

IDES

Git management configuration

  • New: Update all git submodules.

    If it's the first time you check-out a repo you need to use --init first:

    git submodule update --init --recursive
    

    To update to latest tips of remote branches use:

    git submodule update --recursive --remote
    

DevOps

Infrastructure as Code

Ansible Snippets

  • New: Filter json data.

    To select a single element or a data subset from a complex data structure in JSON format (for example, Ansible facts), use the community.general.json_query filter. The community.general.json_query filter lets you query a complex JSON structure and iterate over it using a loop structure.

    This filter is built upon jmespath, and you can use the same syntax. For examples, see jmespath examples.

    A complex example would be:

    "{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}"
    

    This snippet:

    • Gets all dictionaries under the block_device_mappings list which device_name is not equal to /dev/sda1 or /dev/xvda
    • From those results it extracts and flattens only the desired values. In this case device_name and the id which is at the key ebs.volume_id of each of the items of the block_device_mappings list.
  • New: Do asserts.

    - name: After version 2.7 both 'msg' and 'fail_msg' can customize failing assertion message
      ansible.builtin.assert:
        that:
          - my_param <= 100
          - my_param >= 0
        fail_msg: "'my_param' must be between 0 and 100"
        success_msg: "'my_param' is between 0 and 100"
    
  • New: Split a variable in ansible.

    {{ item | split ('@') | last }}
    
  • New: Get a list of EC2 volumes mounted on an instance an their mount points.

    Assuming that each volume has a tag mount_point you could:

    - name: Gather EC2 instance metadata facts
      amazon.aws.ec2_metadata_facts:
    
    - name: Gather info on the mounted disks
      delegate_to: localhost
      block:
        - name: Gather information about the instance
          amazon.aws.ec2_instance_info:
            instance_ids:
              - "{{ ansible_ec2_instance_id }}"
          register: ec2_facts
    
        - name: Gather volume tags
          amazon.aws.ec2_vol_info:
            filters:
              volume-id: "{{ item.id }}"
          # We exclude the root disk as they are already mounted and formatted
          loop: "{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}"
          register: volume_tags_data
    
        - name: Save the required volume data
          set_fact:
            volumes: "{{ volume_tags_data | json_query('results[0].volumes[].{id: id, mount_point: tags.mount_point}') }}"
    
        - name: Display volumes data
          debug:
            msg: "{{ volumes }}"
    
        - name: Make sure that all volumes have a mount point
          assert:
            that:
              - item.mount_point is defined
              - item.mount_point|length > 0
            fail_msg: "Configure the 'mount_point' tag on the volume {{ item.id }} on the instance {{ ansible_ec2_instance_id }}"
            success_msg: "The volume {{ item.id }} has the mount_point tag well set"
          loop: "{{ volumes }}"
    
  • New: Create a list of dictionaries using ansible.

    - name: Create and Add items to dictionary
      set_fact:
          userdata: "{{ userdata | default({}) | combine ({ item.key : item.value }) }}"
      with_items:
        - { 'key': 'Name' , 'value': 'SaravAK'}
        - { 'key': 'Email' , 'value': 'sarav@gritfy.com'}
        - { 'key': 'Location' , 'value': 'Coimbatore'}
        - { 'key': 'Nationality' , 'value': 'Indian'}
    
  • New: Merge two dictionaries on a key.

    If you have these two lists:

    "list1": [
      { "a": "b", "c": "d" },
      { "a": "e", "c": "f" }
    ]
    
    "list2": [
      { "a": "e", "g": "h" },
      { "a": "b", "g": "i" }
    ]
    
    And want to merge them using the value of key "a":

    "list3": [
      { "a": "b", "c": "d", "g": "i" },
      { "a": "e", "c": "f", "g": "h" }
    ]
    

    If you can install the collection community.general use the filter lists_mergeby. The expression below gives the same result

    list3: "{{ list1|community.general.lists_mergeby(list2, 'a') }}"
    

Storage

NAS

OpenZFS

  • New: Monitor dbgmsg with loki.

    If you use loki remember to monitor the /proc/spl/kstat/zfs/dbgmsg file:

    - job_name: zfs
        static_configs:
          - targets:
              - localhost
            labels:
              job: zfs
              __path__: /proc/spl/kstat/zfs/dbgmsg
    
  • Correction: Add loki alerts on the kernel panic error.

    You can monitor this issue with loki using the next alerts:

    groups:
      - name: zfs
        rules:
          - alert: SlowSpaSyncZFSError
           expr: |
              count_over_time({job="zfs"} |~ `spa_deadman.*slow spa_sync` [5m])
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Slow sync traces found in the ZFS debug logs at {{ $labels.hostname}}"
              message: "This usually happens before the ZFS becomes unresponsible"
    
  • New: Monitorization.

    You can monitor this issue with loki using the next alerts:

    groups:
      - name: zfs
        rules:
          - alert: ErrorInSanoidLogs
            expr: |
              count_over_time({job="systemd-journal", syslog_identifier="sanoid"} |= `ERROR` [5m])
            for: 1m
            labels:
                severity: critical
            annotations:
                summary: "Errors found on sanoid log at {{ $labels.hostname}}"
    

Resilience

  • New: Introduce linux resilience.

    Increasing the resilience of the servers is critical when hosting services for others. This is the roadmap I'm following for my servers.

    Autostart services if the system reboots Using init system services to manage your services

    **Get basic metrics traceability and alerts ** Set up Prometheus with:

    • The blackbox exporter to track if the services are available to your users and to monitor SSL certificates health.
    • The node exporter to keep track on the resource usage of your machines and set alerts to get notified when concerning events happen (disks are getting filled, CPU usage is too high)

    **Get basic logs traceability and alerts **

    Set up Loki and clear up your system log errors.

    Improve the resilience of your data If you're still using ext4 for your filesystems instead of zfs you're missing a big improvement. To set it up:

    Automatically react on system failures - Kernel panics - watchdog

    Future undeveloped improvements - Handle the system reboots after kernel upgrades

Memtest

  • New: Introduce memtest.

    memtest86 is a testing software for RAM.

    Installation

    apt-get install memtest86+
    

    After the installation you'll get Memtest entries in grub which you can spawn.

    For some unknown reason the memtest of the boot menu didn't work for me. So I downloaded the latest free version of memtest (It's at the bottom of the screen), burnt it in a usb and booted from there.

    Usage It will run by itself. For 64GB of ECC RAM it took aproximately 100 minutes to run all the tests.

    Check ECC errors MemTest86 directly polls ECC errors logged in the chipset/memory controller registers and displays it to the user on-screen. In addition, ECC errors are written to the log and report file.

watchdog

  • New: Introduce the watchdog.

    A watchdog timer (WDT, or simply a watchdog), sometimes called a computer operating properly timer (COP timer), is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation.

    During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective actions. The corrective actions typically include placing the computer and associated hardware in a safe state and invoking a computer reboot.

    Microcontrollers often include an integrated, on-chip watchdog. In other computers the watchdog may reside in a nearby chip that connects directly to the CPU, or it may be located on an external expansion card in the computer's chassis.

    Hardware watchdog

    Before you start using the hardware watchdog you need to check if your hardware actually supports it.

    If you see Watchdog hardware is disabled error on boot things are not looking good.

    Check if the hardware watchdog is enabled You can see if hardware watchdog is loaded by running wdctl. For example for a machine that has it enabled you'll see:

    Device:        /dev/watchdog0
    Identity:      iTCO_wdt [version 0]
    Timeout:       30 seconds
    Pre-timeout:    0 seconds
    Timeleft:      30 seconds
    FLAG           DESCRIPTION               STATUS BOOT-STATUS
    KEEPALIVEPING  Keep alive ping reply          1           0
    MAGICCLOSE     Supports magic close char      0           0
    SETTIMEOUT     Set timeout (in seconds)       0           0
    

    On a machine that doesn't you'll see:

    wdctl: No default device is available.: No such file or directory
    

    Another option is to run dmesg | grep wd or dmesg | grep watc -i. For example for a machine that has enabled the hardware watchdog you'll see something like:

    [   20.708839] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
    [   20.708894] iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400)
    [   20.709009] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
    

    For one that is not you'll see:

    [    1.934999] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver
    [    1.935057] sp5100-tco sp5100-tco: Using 0xfed80b00 for watchdog MMIO address
    [    1.935062] sp5100-tco sp5100-tco: Watchdog hardware is disabled
    

    If you're out of luck and your hardware doesn't support it you can delegate the task to the software watchdog or get some usb watchdog

    Systemd watchdog

    Starting with version 183 systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support for invidual system services. The basic idea is the following: if enabled, systemd will regularly ping the watchdog hardware. If systemd or the kernel hang this ping will not happen anymore and the hardware will automatically reset the system. This way systemd and the kernel are protected from boundless hangs -- by the hardware. To make the chain complete, systemd then exposes a software watchdog interface for individual services so that they can also be restarted (or some other action taken) if they begin to hang. This software watchdog logic can be configured individually for each service in the ping frequency and the action to take. Putting both parts together (i.e. hardware watchdogs supervising systemd and the kernel, as well as systemd supervising all other services) we have a reliable way to watchdog every single component of the system.

    Configuring the watchdog To make use of the hardware watchdog it is sufficient to set the RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults to 0 (i.e. no hardware watchdog use). Set it to a value like 20s and the watchdog is enabled. After 20s of no keep-alive pings the hardware will reset itself. Note that systemd will send a ping to the hardware at half the specified interval, i.e. every 10s.

    Note that the hardware watchdog device (/dev/watchdog) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon, such as the aptly named watchdog. Although the built-in hardware watchdog support of systemd does not conflict with other watchdog software by default. systemd does not make use of /dev/watchdog by default, and you are welcome to use external watchdog daemons in conjunction with systemd, if this better suits your needs.

    ShutdownWatchdogSec=`` is another option that can be configured in/etc/systemd/system.conf`. It controls the watchdog interval to use during reboots. It defaults to 10min, and adds extra reliability to the system reboot logic: if a clean reboot is not possible and shutdown hangs, we rely on the watchdog hardware to reset the system abruptly, as extra safety net.

    Now, let's have a look how to add watchdog logic to individual services.

    First of all, to make software watchdog-supervisable it needs to be patched to send out "I am alive" signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify("WATCHDOG=1") calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.

    To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service's processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.

    The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failure for the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values are reboot, reboot-force and reboot-immediate.

    • reboot attempts a clean reboot, going through the usual, clean shutdown logic.
    • reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast).
    • reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).

    Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn't help.

    Here's an example unit file:

    [Unit]
    Description=My Little Daemon
    Documentation=man:mylittled(8)
    
    [Service]
    ExecStart=/usr/bin/mylittled
    WatchdogSec=30s
    Restart=on-failure
    StartLimitInterval=5min
    StartLimitBurst=4
    StartLimitAction=reboot-force
    ````
    
    This service will automatically be restarted if it hasn't pinged the system manager for longer than 30s or if it fails otherwise. If it is restarted this way more often than 4 times in 5min action is taken and the system quickly rebooted, with all file systems being clean when it comes up again.
    
    To write the code of the watchdog service you can follow one of these guides:
    
    - [Python based watchdog](https://sleeplessbeastie.eu/2022/08/15/how-to-create-watchdog-for-systemd-service/)
    - [Bash based watchdog](https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/)
    
    **[Testing a watchdog](https://serverfault.com/questions/375220/how-to-check-what-if-hardware-watchdogs-are-available-in-linux)**
    One simple way to test a watchdog is to trigger a kernel panic. This can be done as root with:
    
    ```bash
    echo c > /proc/sysrq-trigger
    

    The kernel will stop responding to the watchdog pings, so the watchdog will trigger.

    SysRq is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. It can also be used by echoing letters to /proc/sysrq-trigger, like we're doing here.

    In this case, the letter c means perform a system crash and take a crashdump if configured.

    Troubleshooting

    Watchdog hardware is disabled error on boot

    According to the discussion at the kernel mailing list it means that the system contains hardware watchdog but it has been disabled (probably by BIOS) and Linux cannot enable the hardware.

    If your BIOS doesn't have a switch to enable it, consider the watchdog hardware broken for your system.

    Some people are blacklisting the module so that it's not loaded and therefore it doesn't return the error (1, 2

    References

Monitoring

Loki

  • Correction: Use fake when using one loki instance in docker.

    If you only have one Loki instance you need to save the rule yaml files in the /etc/loki/rules/fake/ otherwise Loki will silently ignore them (it took me a lot of time to figure this out -.-).

  • New: Add alerts.

    Surprisingly I haven't found any compilation of Loki alerts. I'll gather here the ones I create.

    There are two kinds of rules: alerting rules and recording rules.

Promtail

  • New: Install with ansible.

    Use patrickjahns ansible role. Some interesting variables are:

    loki_url: localhost
    promtail_system_user: root
    
    promtail_config_clients:
      - url: "http://{{ loki_url }}:3100/loki/api/v1/push"
        external_labels:
          hostname: "{{ ansible_hostname }}"
    
  • New: Troubleshooting promtail.

    Find where is the positions.yaml file and see if it evolves.

    Sometimes if you are not seeing the logs in loki it's because the query you're running is not correct.

Process Exporter

Hardware

ECC RAM

  • New: Introduce ECC RAM.

    Error Correction Code (ECC) is a mechanism used to detect and correct errors in memory data due to environmental interference and physical defects. ECC memory is used in high-reliability applications that cannot tolerate failure due to corrupted data.

    Installation: Due to additional circuitry required for ECC protection, specialized ECC hardware support is required by the CPU chipset, motherboard and DRAM module. This includes the following:

    • Server-grade CPU chipset with ECC support (Intel Xeon, AMD Ryzen)
    • Motherboard supporting ECC operation
    • ECC RAM

    Consult the motherboard and/or CPU documentation for the specific model to verify whether the hardware supports ECC. Use vendor-supplied list of certified ECC RAM, if provided.

    Most ECC-supported motherboards allow you to configure ECC settings from the BIOS setup. They are usually on the Advanced tab. The specific option depends on the motherboard vendor or model such as the following:

    • DRAM ECC Enable (American Megatrends, ASUS, ASRock, MSI)
    • ECC Mode (ASUS)

    Monitorization

    The mechanism for how ECC errors are logged and reported to the end-user depends on the BIOS and operating system. In most cases, corrected ECC errors are written to system/event logs. Uncorrected ECC errors may result in kernel panic or blue screen.

    The Linux kernel supports reporting ECC errors for ECC memory via the EDAC (Error Detection And Correction) driver subsystem. Depending on the Linux distribution, ECC errors may be reported by the following:

    • rasdaemon: monitor ECC memory and report both correctable and uncorrectable memory errors on recent Linux kernels.
    • mcelog (Deprecated): collects and decodes MCA error events on x86.
    • edac-utils (Deprecated): fills DIMM labels data and summarizes memory errors.

    To configure rasdaemon follow this article

    Confusion on boards supporting ECC

    I've read that even if some motherboards say that they "Support ECC" some of them don't do anything with it.

    On this post and the kernel docs show that you should see references to ACPI/WHEA in the specs manual. Ideally ACPI5 support.

    From the ) EINJ provides a hardware error injection mechanism. It is very useful for debugging and testing APEI and RAS features in general.

    You need to check whether your BIOS supports EINJ first. For that, look for early boot messages similar to this one:

    ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL           00000001 INTL 00000001)
    

    Which shows that the BIOS is exposing an EINJ table - it is the mechanism through which the injection is done.

    Alternatively, look in /sys/firmware/acpi/tables for an "EINJ" file, which is a different representation of the same thing.

    It doesn't necessarily mean that EINJ is not supported if those above don't exist: before you give up, go into BIOS setup to see if the BIOS has an option to enable error injection. Look for something called WHEA or similar. Often, you need to enable an ACPI5 support option prior, in order to see the APEI,EINJ,... functionality supported and exposed by the BIOS menu.

    To use EINJ, make sure the following are options enabled in your kernel configuration:

    CONFIG_DEBUG_FS
    CONFIG_ACPI_APEI
    CONFIG_ACPI_APEI_EINJ
    

    One way to test it can be to run memtest as it sometimes shows ECC errors such as ** Warning** ECC injection may be disabled for AMD Ryzen (70h-7fh).

    Other people (1, 2 say that there are a lot of motherboards that NEVER report any corrected errors to the OS. In order to see corrected errors, PFEH (Platform First Error Handling) has to be disabled. On some motherboards and FW versions this setting is hidden from the user and always enabled, thus resulting in zero correctable errors getting reported.

    They also suggest to disable "Quick Boot". In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled.

    The people behind memtest have a paid tool to test ECC

  • New: Introduce rasdaemon the ECC monitor.

    rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.

    Installation

    apt-get install rasdaemon
    

    The output will be available via syslog but you can show it to the foreground (-f) or to an sqlite3 database (-r)

    To post-process and decode received MCA errors on AMD SMCA systems, run:

    rasdaemon -p --status <STATUS_reg> --ipid <IPID_reg> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
    

    Status and IPID Register values (in hex) are mandatory. The smca flag with family and model are required if not decoding locally. Bank parameter is optional.

    You may also start it via systemd:

    systemctl start rasdaemon
    

    The rasdaemon will then output the messages to journald.

    Usage

    At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. If everything is well configured you'll see something like:

    $: ras-mc-ctl --error-count
    Label                 CE      UE
    mc#0csrow#2channel#0  0   0
    mc#0csrow#2channel#1  0   0
    mc#0csrow#3channel#1  0   0
    mc#0csrow#3channel#0  0   0
    

    If it's not you'll see:

    ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.
    

    The CE column represents the number of corrected errors for a given DIMM, UE represents uncorrectable errors that were detected. The label on the left shows the EDAC path under /sys/devices/system/edac/mc/ of every DIMM. This is not very readable, if you wish to improve the labeling read this article

    More ways to check is to run:

    $: ras-mc-ctl --status
    ras-mc-ctl: drivers are loaded.
    

    You can also see a summary of the state with:

    $: ras-mc-ctl --summary
    No Memory errors.
    
    No PCIe AER errors.
    
    No Extlog errors.
    
    DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
    Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.
    

    Monitorization

    You can use loki to monitor ECC errors shown in the logs with the next alerts:

    groups:
      - name: ecc
        rules:
          - alert: ECCError
            expr: |
              count_over_time({job="systemd-journal", unit="rasdaemon.service", level="error"} [5m])  > 0
            for: 1m
            labels:
                severity: critical
            annotations:
                summary: "Possible ECC error detected in {{ $labels.hostname}}"
    
          - alert: ECCWarning
            expr: |
              count_over_time({job="systemd-journal", unit="rasdaemon.service", level="warning"} [5m])  > 0
            for: 1m
            labels:
                severity: warning
            annotations:
                summary: "Possible ECC warning detected in {{ $labels.hostname}}"
          - alert: ECCAlert
            expr: |
              count_over_time({job="systemd-journal", unit="rasdaemon.service", level!~"info|error|warning"} [5m]) > 0
            for: 1m
            labels:
                severity: info
            annotations:
                summary: "ECC log trace with unknown severity level detected in {{ $labels.hostname}}"
    
    References

Operating Systems

Linux

Linux Snippets

  • New: Send multiline messages with notify-send.

    The title can't have new lines, but the body can.

    notify-send "Title" "This is the first line.\nAnd this is the second.")
    
    feat(linux_snippets#Find BIOS version): Find BIOS version

    dmidecode | less
    
  • New: Reboot server on kernel panic.

    The proc/sys/kernel/panic file gives read/write access to the kernel variable panic_timeout. If this is zero, the kernel will loop on a panic; if nonzero it indicates that the kernel should autoreboot after this number of seconds. When you use the software watchdog device driver, the recommended setting is 60.

    To set the value add the next contents to the /etc/sysctl.d/99-panic.conf

    kernel.panic = 60
    

    Or with an ansible task:

    - name: Configure reboot on kernel panic
      become: true
      lineinfile:
        path: /etc/sysctl.d/99-panic.conf
        line: kernel.panic = 60
        create: true
        state: present
    
  • New: Share a calculated value between github actions steps.

    You need to set a step's output parameter. Note that the step will need an id to be defined to later retrieve the output value.

    echo "{name}={value}" >> "$GITHUB_OUTPUT"
    

    For example:

    - name: Set color
      id: color-selector
      run: echo "SELECTED_COLOR=green" >> "$GITHUB_OUTPUT"
    - name: Get color
      env:
        SELECTED_COLOR: ${{ steps.color-selector.outputs.SELECTED_COLOR }}
      run: echo "The selected color is $SELECTED_COLOR"
    
  • New: Split a zip into sizes with restricted size.

    Something like:

    zip -9 myfile.zip *
    zipsplit -n 250000000 myfile.zip
    

    Would produce myfile1.zip, myfile2.zip, etc., all independent of each other, and none larger than 250MB (in powers of ten). zipsplit will even try to organize the contents so that each resulting archive is as close as possible to the maximum size.

  • New: Find files that were modified between dates.

    The best option is the -newerXY. The m and t flags can be used.

    • m The modification time of the file reference
    • t reference is interpreted directly as a time

    So the solution is

    find . -type f -newermt 20111222 \! -newermt 20111225
    

    The lower bound in inclusive, and upper bound is exclusive, so I added 1 day to it. And it is recursive.

Magic keys

  • New: Introduce the Magic Keys.

    The magic SysRq key is a key combination understood by the Linux kernel, which allows the user to perform various low-level commands regardless of the system's state. It is often used to recover from freezes, or to reboot a computer without corrupting the filesystem.[1] Its effect is similar to the computer's hardware reset button (or power switch) but with many more options and much more control.

    This key combination provides access to powerful features for software development and disaster recovery. In this sense, it can be considered a form of escape sequence. Principal among the offered commands are means to forcibly unmount file systems, kill processes, recover keyboard state, and write unwritten data to disk. With respect to these tasks, this feature serves as a tool of last resort.

    The magic SysRq key cannot work under certain conditions, such as a kernel panic[2] or a hardware failure preventing the kernel from running properly.

    The key combination consists of Alt+Sys Req and another key, which controls the command issued.

    On some devices, notably laptops, the Fn key may need to be pressed to use the magic SysRq key.

    Reboot the machine

    A common use of the magic SysRq key is to perform a safe reboot of a Linux computer which has otherwise locked up (abbr. REISUB). This can prevent a fsck being required on reboot and gives some programs a chance to save emergency backups of unsaved work. The QWERTY (or AZERTY) mnemonics: "Raising Elephants Is So Utterly Boring", "Reboot Even If System Utterly Broken" or simply the word "BUSIER" read backwards, are often used to remember the following SysRq-keys sequence:

    • unRaw (take control of keyboard back from X),
    • tErminate (send SIGTERM to all processes, allowing them to terminate gracefully),
    • kIll (send SIGKILL to all processes, forcing them to terminate immediately),
    • Sync (flush data to disk),
    • Unmount (remount all filesystems read-only),
    • reBoot.

    When magic SysRq keys are used to kill a frozen graphical program, the program has no chance to restore text mode. This can make everything unreadable. The commands textmode (part of SVGAlib) and the reset command can restore text mode and make the console readable again.

    On distributions that do not include a textmode executable, the key command Ctrl+Alt+F1 may sometimes be able to force a return to a text console. (Use F1, F2, F3,..., F(n), where n is the highest number of text consoles set up by the distribution. Ctrl+Alt+F(n+1) would normally be used to reenter GUI mode on a system on which the X server has not crashed.)

    Interact with the sysrq through the commandline It can also be used by echoing letters to /proc/sysrq-trigger, for example to trigger a system crash and take a crashdump you can:

    echo c > /proc/sysrq-trigger
    

Pass

  • New: Add rofi launcher.

    pass is a command line password store

    Configure rofi launcher

    • Save this script somewhere in your $PATH
    • Configure your window manager to launch it whenever you need a password.

rofi

  • New: Introduce Rofi.

    Rofi is a window switcher, application launcher and dmenu replacement.

    Installation

    sudo apt-get install rofi
    

    Usage To launch rofi directly in a certain mode, specify a mode with rofi -show <mode>. To show the run dialog:

    rofi -show run
    

    Or get the options from a script:

    ~/my_script.sh | rofi -dmenu
    

    Specify an ordered, comma-separated list of modes to enable. Enabled modes can be changed at runtime. Default key is Ctrl+Tab. If no modes are specified, all configured modes will be enabled. To only show the run and ssh launcher:

    rofi -modes "run,ssh" -show run
    

    The modes to combine in combi mode. For syntax to -combi-modes , see -modes. To get one merge view, of window,run, and ssh:

    rofi -show combi -combi-modes "window,run,ssh" -modes combi
    

    Configuration

    The configuration lives at ~/.config/rofi/config.rasi to create this file with the default conf run:

    rofi -dump-config > ~/.config/rofi/config.rasi
    

    Use fzf to do the matching

    To run once:

    rofi -show run -sorting-method fzf -matching fuzzy
    

    To persist them change those same values in the configuration.

    Theme changing To change the theme: - Choose the one you like most looking here - Run rofi-theme-selector to select it - Accept it with Alt + a

    Keybindings change

    Plugins You can write your custom plugins. If you're on python using python-rofi seems to be the best option although it looks unmaintained.

    Some interesting examples are:

    Other interesting references are:

    References - Source - Docs - Plugins

Android

Signal

  • New: Use the Molly FOSS android client.

    Molly is an independent Signal fork for Android. The advantages are:

    • Contains no proprietary blobs, unlike Signal.
    • Protects database with passphrase encryption.
    • Locks down the app automatically when you are gone for a set period of time.
    • Securely shreds sensitive data from RAM.
    • Automatic backups on a daily or weekly basis.
    • Supports SOCKS proxy and Tor via Orbot.

    Migrate from Signal

    Note, the migration should be done when the available Molly version is equal to or later than the currently installed Signal app version.

    • Verify your Signal backup passphrase. In the Signal app: Settings > Chats > Chat backups > Verify backup passphrase.
    • Optionally, put your phone offline (enable airplane mode or disable data services) until after Signal is uninstalled in step 5. This will prevent the possibility of losing any Signal messages that are received during or after the backup is created.
    • Create a Signal backup. In the Signal app, go to Settings > Chats > Chat backups > Create backup.
    • Uninstall the Signal app. Now you can put your phone back online (disable airplane mode or re-enable data services).
    • Install the Molly or Molly-FOSS app.
    • Open the Molly app. Enable database encryption if desired. As soon as the option is given, tap Transfer or restore account. Answer any permissions questions.
    • Choose to Restore from backup and tap Choose backup. Navigate to your Signal backup location (Signal/Backups/, by default) and choose the backup that was created in step 3.
    • Check the backup details and then tap Restore backup to confirm. Enter the backup passphrase when requested.
    • If asked, choose a new folder for backup storage. Or choose Not Now and do it later.

    Consider also:

    • Any previously linked devices will need to be re-linked. Go to Settings > Linked devices in the Molly app. If Signal Desktop is not detecting that it is no longer linked, try restarting it.
    • Verify your Molly backup settings and passphrase at Settings > Chats > Chat backups (to change the backup folder, disable and then enable backups). Tap Create backup to create your first Molly backup.
    • When you are satisfied that Molly is working, you may want to delete the old Signal backups (in Signal/Backups, by default).

Other