AlertManager
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
Configuration⚑
It is configured through the alertmanager.config
key of the values.yaml
of the helm chart or the alertmanager.yaml
file if you're using docker-compose
.
As stated in the configuration file, it has four main keys (as templates
is handled in alertmanager.config.templateFiles
):
global
: SMTP and API main configuration, it will be inherited by the other elements.route
: Route tree definition.receivers
: Notification integrations configuration.inhibit_rules
: Alert inhibition configuration.
Receivers⚑
Notification receivers are the named configurations of one or more notification integrations.
Null receiver⚑
Useful to ditch alerts that shouldn't be inhibited
receivers:
- name: 'null'
Email notifications⚑
To configure email notifications, set up the following in your config
:
config:
global:
smtp_from: {{ from_email_address }}
smtp_smarthost: {{ smtp_server_endpoint }}:{{ smtp_server_port }}
smtp_auth_username: {{ smpt_authentication_username }}
smtp_auth_password: {{ smpt_authentication_password }}
receivers:
- name: 'email'
email_configs:
- to: {{ receiver_email }}
send_resolved: true
If you need to set smtp_auth_username
and smtp_auth_password
you should value using helm secrets.
send_resolved
, set to False
by default, defines whether or not to notify about resolved alerts.
Rocketchat Notifications⚑
Go to pavel-kazhavets AlertmanagerRocketChat repo for the updated rules.
In RocketChat:
- Login as admin user and go to: Administration => Integrations => New Integration => Incoming WebHook.
- Set "Enabled" and "Script Enabled" to "True".
- Set all channel, icons, etc. as you need.
- Paste contents of the official AlertmanagerIntegrations.js or my version into Script field.
AlertmanagerIntegrations.js
class Script {
process_incoming_request({
request
}) {
console.log(request.content);
var alertColor = "warning";
if (request.content.status == "resolved") {
alertColor = "good";
} else if (request.content.status == "firing") {
alertColor = "danger";
}
let finFields = [];
for (i = 0; i < request.content.alerts.length; i++) {
var endVal = request.content.alerts[i];
var elem = {
title: "alertname: " + endVal.labels.alertname,
value: "*instance:* " + endVal.labels.instance,
short: false
};
finFields.push(elem);
if (!!endVal.annotations.summary) {
finFields.push({
title: "summary",
value: endVal.annotations.summary
});
}
if (!!endVal.annotations.severity) {
finFields.push({
title: "severity",
value: endVal.labels.severity
});
}
if (!!endVal.annotations.grafana) {
finFields.push({
title: "grafana",
value: endVal.annotations.grafana
});
}
if (!!endVal.annotations.prometheus) {
finFields.push({
title: "prometheus",
value: endVal.annotations.prometheus
});
}
if (!!endVal.annotations.message) {
finFields.push({
title: "message",
value: endVal.annotations.message
});
}
if (!!endVal.annotations.description) {
finFields.push({
title: "description",
value: endVal.annotations.description
});
}
}
return {
content: {
username: "Prometheus Alert",
attachments: [{
color: alertColor,
title_link: request.content.externalURL,
title: "Prometheus notification",
fields: finFields
}]
}
};
return {
error: {
success: false
}
};
}
}
- Create Integration. The field
Webhook URL
will appear in the Integration configuration.
In Alertmanager:
- Create new receiver or modify config of existing one. You'll need to add
webhooks_config
to it. Small example:
route:
repeat_interval: 30m
group_interval: 30m
receiver: 'rocketchat'
receivers:
- name: 'rocketchat'
webhook_configs:
- send_resolved: false
url: '${WEBHOOK_URL}'
- Reload/restart alertmanager.
In order to test the webhook you can use the following curl (replace {{ webhook-url }}
):
curl -X POST -H 'Content-Type: application/json' --data '
{
"text": "Example message",
"attachments": [
{
"title": "Rocket.Chat",
"title_link": "https://rocket.chat",
"text": "Rocket.Chat, the best open source chat",
"image_url": "https://rocket.cha t/images/mockup.png",
"color": "#764FA5"
}
],
"status": "firing",
"alerts": [
{
"labels": {
"alertname": "high_load",
"severity": "major",
"instance": "node-exporter:9100"
},
"annotations": {
"message": "node-exporter:9100 of job xxxx is under high load.",
"summary": "node-exporter:9100 under high load."
}
}
]
}
' {{ webhook-url }}
Route⚑
A route block defines a node in a routing tree and its children. Its optional configuration parameters are inherited from its parent node if not set.
Every alert enters the routing tree at the configured top-level route, which must match all alerts (i.e. not have any configured matchers). It then traverses the child nodes. If continue is set to false, it stops after the first matching child. If continue is true on a matching node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node (no matching child nodes, or none exist), the alert is handled based on the configuration parameters of the current node.
A basic configuration would be:
route:
group_by: [job, alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'email'
routes:
- match:
alertname: Watchdog
receiver: 'null'
Inhibit rules⚑
Inhibit rules define which alerts triggered by Prometheus shouldn't be forwarded to the notification integrations. For example the Watchdog
alert is meant to test that everything works as expected, but is not meant to be used by the users. Similarly, if you are using EKS, you'll probably have an KubeVersionMismatch
, because Kubernetes allows a certain version skew between their components. So the alert is more strict than the Kubernetes policy.
To disable both alerts, set a match
rule in config.inhibit_rules
:
config:
inhibit_rules:
- target_match:
alertname: Watchdog
- target_match:
alertname: KubeVersionMismatch
Inhibit rules between times⚑
To prevent some alerts to be sent between some hours you can use the time_intervals
alertmanager configuration.
This can be useful for example if your backup system triggers some alerts that you don't need to act on.
# See route configuration at https://prometheus.io/docs/alerting/latest/configuration/#route
route:
receiver: 'email'
group_by: [job, alertname, severity]
group_wait: 5m
group_interval: 5m
repeat_interval: 12h
routes:
- receiver: 'email'
matchers:
- alertname =~ "HostCpuHighIowait|HostContextSwitching|HostUnusualDiskWriteRate"
- hostname = backup_server
mute_time_intervals:
- night
time_intervals:
- name: night
time_intervals:
- times:
- start_time: 02:00
end_time: 07:00
The values of time_intervals
can be:
- times:
[ - <time_range> ...]
weekdays:
[ - <weekday_range> ...]
days_of_month:
[ - <days_of_month_range> ...]
months:
[ - <month_range> ...]
years:
[ - <year_range> ...]
location: <string>
All fields are lists. Within each non-empty list, at least one element must be satisfied to match the field. If a field is left unspecified, any value will match the field. For an instant of time to match a complete time interval, all fields must match. Some fields support ranges and negative indices, and are detailed below. If a time zone is not specified, then the times are taken to be in UTC.
-
time_range
: Ranges inclusive of the starting time and exclusive of the end time to make it easy to represent times that start/end on hour boundaries. For example,start_time: '17:00'
andend_time: '24:00'
will begin at 17:00 and finish immediately before 24:00. They are specified like so:times: - start_time: HH:MM end_time: HH:MM
-
weekday_range
: A list of days of the week, where the week begins on Sunday and ends on Saturday. Days should be specified by name (e.g. 'Sunday'). For convenience, ranges are also accepted of the form<start_day>:<end_day>
and are inclusive on both ends. For example:['monday:wednesday','saturday', 'sunday']
-
days_of_month_range
: A list of numerical days in the month. Days begin at1
. Negative values are also accepted which begin at the end of the month, e.g.-1
during January would represent January 31. For example:['1:5', '-3:-1']
. Extending past the start or end of the month will cause it to be clamped. E.g. specifying['1:31']
during February will clamp the actual end date to 28 or 29 depending on leap years. Inclusive on both ends. -
month_range
: A list of calendar months identified by a case-insensitive name (e.g. 'January') or by number, whereJanuary = 1
. Ranges are also accepted. For example,['1:3', 'may:august', 'december']
. Inclusive on both ends. -
year_range
: A numerical list of years. Ranges are accepted. For example,['2020:2022', '2030']
. Inclusive on both ends. -
location
: A string that matches a location in the IANA time zone database. For example,'Australia/Sydney'
. The location provides the time zone for the time interval. For example, a time interval with a location of'Australia/Sydney'
that contained something like:
times:
- start_time: 09:00
end_time: 17:00
weekdays: ['monday:friday']
Would include any time that fell between the hours of 9:00AM and 5:00PM, between Monday and Friday, using the local time in Sydney, Australia. You may also use 'Local'
as a location to use the local time of the machine where Alertmanager is running, or 'UTC'
for UTC time. If no timezone is provided, the time interval is taken to be in UTC.
If that doesn't work for you, you can use the sleep peacefully guidelines to tackle it at query level.
Alert rules⚑
Alert rules are a special kind of Prometheus Rules that trigger alerts based on PromQL expressions. People have gathered several examples under Awesome prometheus alert rules
Alerts must be configured in the Prometheus configuration, either through the operator helm chart, under the additionalPrometheusRulesMap
or in the prometheus.yml
file. For example:
additionalPrometheusRulesMap:
- groups:
- name: alert-rules
rules:
- alert: BlackboxProbeFailed
expr: probe_success == 0
for: 5m
labels:
severity: error
annotations:
summary: "Blackbox probe failed (instance {{ $labels.target }})"
description: "Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Other examples of rules are:
If you are using prometheus.yml
directly, you also need to configure the alerting:
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets: [ 'alertmanager:9093' ]
Silences⚑
To silence an alert with a regular expression use the matcher alertname=~".*Condition"
.