Skip to content

AlertManager

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

Configuration

It is configured through the alertmanager.config key of the values.yaml of the helm chart or the alertmanager.yaml file if you're using docker-compose.

As stated in the configuration file, it has four main keys (as templates is handled in alertmanager.config.templateFiles):

  • global: SMTP and API main configuration, it will be inherited by the other elements.
  • route: Route tree definition.
  • receivers: Notification integrations configuration.
  • inhibit_rules: Alert inhibition configuration.

Receivers

Notification receivers are the named configurations of one or more notification integrations.

Null receiver

Useful to ditch alerts that shouldn't be inhibited

receivers:
  - name: 'null'

Email notifications

To configure email notifications, set up the following in your config:

  config:
    global:
      smtp_from: {{ from_email_address }}
      smtp_smarthost: {{ smtp_server_endpoint }}:{{ smtp_server_port }}
      smtp_auth_username: {{ smpt_authentication_username }}
      smtp_auth_password: {{ smpt_authentication_password }}
    receivers:
    - name: 'email'
      email_configs:
        - to: {{ receiver_email }}
          send_resolved: true

If you need to set smtp_auth_username and smtp_auth_password you should value using helm secrets.

send_resolved, set to False by default, defines whether or not to notify about resolved alerts.

Rocketchat Notifications

Go to pavel-kazhavets AlertmanagerRocketChat repo for the updated rules.

In RocketChat:

  • Login as admin user and go to: Administration => Integrations => New Integration => Incoming WebHook.
  • Set "Enabled" and "Script Enabled" to "True".
  • Set all channel, icons, etc. as you need.
  • Paste contents of the official AlertmanagerIntegrations.js or my version into Script field.
AlertmanagerIntegrations.js
class Script {
    process_incoming_request({
        request
    }) {
        console.log(request.content);

        var alertColor = "warning";
        if (request.content.status == "resolved") {
            alertColor = "good";
        } else if (request.content.status == "firing") {
            alertColor = "danger";
        }

        let finFields = [];
        for (i = 0; i < request.content.alerts.length; i++) {
            var endVal = request.content.alerts[i];
            var elem = {
                title: "alertname: " + endVal.labels.alertname,
                value: "*instance:* " + endVal.labels.instance,
                short: false
            };

            finFields.push(elem);

            if (!!endVal.annotations.summary) {
                finFields.push({
                    title: "summary",
                    value: endVal.annotations.summary
                });
            }

            if (!!endVal.annotations.severity) {
                finFields.push({
                    title: "severity",
                    value: endVal.labels.severity
                });
            }

            if (!!endVal.annotations.grafana) {
                finFields.push({
                    title: "grafana",
                    value: endVal.annotations.grafana
                });
            }

            if (!!endVal.annotations.prometheus) {
                finFields.push({
                    title: "prometheus",
                    value: endVal.annotations.prometheus
                });
            }

            if (!!endVal.annotations.message) {
                finFields.push({
                    title: "message",
                    value: endVal.annotations.message
                });
            }

            if (!!endVal.annotations.description) {
                finFields.push({
                    title: "description",
                    value: endVal.annotations.description
                });
            }
        }

        return {
            content: {
                username: "Prometheus Alert",
                attachments: [{
                    color: alertColor,
                    title_link: request.content.externalURL,
                    title: "Prometheus notification",
                    fields: finFields
                }]
            }
        };

        return {
            error: {
                success: false
            }
        };
    }
}
  • Create Integration. The field Webhook URL will appear in the Integration configuration.

In Alertmanager:

  • Create new receiver or modify config of existing one. You'll need to add webhooks_config to it. Small example:
route:
    repeat_interval: 30m
    group_interval: 30m
    receiver: 'rocketchat'

receivers:
    - name: 'rocketchat'
      webhook_configs:
          - send_resolved: false
            url: '${WEBHOOK_URL}'
  • Reload/restart alertmanager.

In order to test the webhook you can use the following curl (replace {{ webhook-url }}):

curl -X POST -H 'Content-Type: application/json' --data '
{
  "text": "Example message",
  "attachments": [
    {
      "title": "Rocket.Chat",
      "title_link": "https://rocket.chat",
      "text": "Rocket.Chat, the best open source chat",
      "image_url": "https://rocket.cha t/images/mockup.png",
      "color": "#764FA5"
    }
  ],
  "status": "firing",
  "alerts": [
    {
      "labels": {
        "alertname": "high_load",
        "severity": "major",
        "instance": "node-exporter:9100"
      },
      "annotations": {
        "message": "node-exporter:9100 of job xxxx is under high load.",
        "summary": "node-exporter:9100 under high load."
      }
    }
  ]
}
' {{ webhook-url }}

Route

A route block defines a node in a routing tree and its children. Its optional configuration parameters are inherited from its parent node if not set.

Every alert enters the routing tree at the configured top-level route, which must match all alerts (i.e. not have any configured matchers). It then traverses the child nodes. If continue is set to false, it stops after the first matching child. If continue is true on a matching node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node (no matching child nodes, or none exist), the alert is handled based on the configuration parameters of the current node.

A basic configuration would be:

route:
  group_by: [job, alertname, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'email'
  routes:
    - match:
        alertname: Watchdog
      receiver: 'null'

Inhibit rules

Inhibit rules define which alerts triggered by Prometheus shouldn't be forwarded to the notification integrations. For example the Watchdog alert is meant to test that everything works as expected, but is not meant to be used by the users. Similarly, if you are using EKS, you'll probably have an KubeVersionMismatch, because Kubernetes allows a certain version skew between their components. So the alert is more strict than the Kubernetes policy.

To disable both alerts, set a match rule in config.inhibit_rules:

  config:
    inhibit_rules:
      - target_match:
          alertname: Watchdog
      - target_match:
          alertname: KubeVersionMismatch

Inhibit rules between times

To prevent some alerts to be sent between some hours you can use the time_intervals alertmanager configuration.

This can be useful for example if your backup system triggers some alerts that you don't need to act on.

# See route configuration at https://prometheus.io/docs/alerting/latest/configuration/#route
route:
  receiver: 'email'
  group_by: [job, alertname, severity]
  group_wait: 5m
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - receiver: 'email'
      matchers:
        - alertname =~ "HostCpuHighIowait|HostContextSwitching|HostUnusualDiskWriteRate"
        - hostname = backup_server
      mute_time_intervals:
        - night
time_intervals:
  - name: night
    time_intervals:
      - times:
          - start_time: 02:00
            end_time: 07:00

The values of time_intervals can be:

- times:
  [ - <time_range> ...]
  weekdays:
  [ - <weekday_range> ...]
  days_of_month:
  [ - <days_of_month_range> ...]
  months:
  [ - <month_range> ...]
  years:
  [ - <year_range> ...]
  location: <string>

All fields are lists. Within each non-empty list, at least one element must be satisfied to match the field. If a field is left unspecified, any value will match the field. For an instant of time to match a complete time interval, all fields must match. Some fields support ranges and negative indices, and are detailed below. If a time zone is not specified, then the times are taken to be in UTC.

  • time_range: Ranges inclusive of the starting time and exclusive of the end time to make it easy to represent times that start/end on hour boundaries. For example, start_time: '17:00' and end_time: '24:00' will begin at 17:00 and finish immediately before 24:00. They are specified like so:

    times:
    - start_time: HH:MM
      end_time: HH:MM
    

  • weekday_range: A list of days of the week, where the week begins on Sunday and ends on Saturday. Days should be specified by name (e.g. 'Sunday'). For convenience, ranges are also accepted of the form <start_day>:<end_day> and are inclusive on both ends. For example: ['monday:wednesday','saturday', 'sunday']

  • days_of_month_range: A list of numerical days in the month. Days begin at 1. Negative values are also accepted which begin at the end of the month, e.g. -1 during January would represent January 31. For example: ['1:5', '-3:-1']. Extending past the start or end of the month will cause it to be clamped. E.g. specifying ['1:31'] during February will clamp the actual end date to 28 or 29 depending on leap years. Inclusive on both ends.

  • month_range: A list of calendar months identified by a case-insensitive name (e.g. 'January') or by number, where January = 1. Ranges are also accepted. For example, ['1:3', 'may:august', 'december']. Inclusive on both ends.

  • year_range: A numerical list of years. Ranges are accepted. For example, ['2020:2022', '2030']. Inclusive on both ends.

  • location: A string that matches a location in the IANA time zone database. For example, 'Australia/Sydney'. The location provides the time zone for the time interval. For example, a time interval with a location of 'Australia/Sydney' that contained something like:

times:
- start_time: 09:00
  end_time: 17:00
weekdays: ['monday:friday']

Would include any time that fell between the hours of 9:00AM and 5:00PM, between Monday and Friday, using the local time in Sydney, Australia. You may also use 'Local' as a location to use the local time of the machine where Alertmanager is running, or 'UTC' for UTC time. If no timezone is provided, the time interval is taken to be in UTC.

If that doesn't work for you, you can use the sleep peacefully guidelines to tackle it at query level.

Alert rules

Alert rules are a special kind of Prometheus Rules that trigger alerts based on PromQL expressions. People have gathered several examples under Awesome prometheus alert rules

Alerts must be configured in the Prometheus configuration, either through the operator helm chart, under the additionalPrometheusRulesMap or in the prometheus.yml file. For example:

additionalPrometheusRulesMap:
  - groups:
      - name: alert-rules
        rules:
          - alert: BlackboxProbeFailed
            expr: probe_success == 0
            for: 5m
            labels:
              severity: error
            annotations:
              summary: "Blackbox probe failed (instance {{ $labels.target }})"
              description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Other examples of rules are:

If you are using prometheus.yml directly, you also need to configure the alerting:

alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets: [ 'alertmanager:9093' ]

Silences

To silence an alert with a regular expression use the matcher alertname=~".*Condition".

References