Blackbox Exporter

The blackbox exporter allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP and ICMP.

It can be used to test:

Website accessibility. Both for availability and security purposes.
Website loading time.
DNS response times to diagnose network latency issues.
SSL certificates expiration.
ICMP requests to gather network health information.
Security protections such as if and endpoint stops being protected by VPN, WAF or SSL client certificate.
Unauthorized read or write S3 buckets.

When running, the Blackbox exporter is going to expose a HTTP endpoint that can be used in order to monitor targets over the network. By default, the Blackbox exporter exposes the /probe endpoint that is used to retrieve those metrics.

The blackbox exporter is configured with a YAML configuration file made of modules.

Installation⚑

To install the exporter we'll use helmfile to install the stable/prometheus-blackbox-exporter chart.

Add the following lines to your helmfile.yaml.

- name: prometheus-blackbox-exporter
  namespace: monitoring
  chart: stable/prometheus-blackbox-exporter
  values:
    - prometheus-blackbox-exporter/values.yaml

Edit the chart values.

mkdir prometheus-blackbox-exporter
helm inspect values stable/prometheus-blackbox-exporter > prometheus-blackbox-exporter/values.yaml
vi prometheus-blackbox-exporter/values.yaml

Make sure to enable the serviceMonitor in the values and target at least one page:

serviceMonitor:
  enabled: true

  # Default values that will be used for all ServiceMonitors created by `targets`
  defaults:
    labels:
      release: prometheus-operator
    interval: 30s
    scrapeTimeout: 30s
    module: http_2xx

  targets:
   - name: lyz-code.github.io/blue-book
     url: https://lyz-code.github.io/blue-book

The label release: prometheus-operator must be the one your prometheus instance is searching for.

If you want to use the icmp probe, make sure to allow allowIcmp: true.

If you want to probe endpoints protected behind client SSL certificates, until this chart issue is solved, you need to create them manually as the Prometheus blackbox exporter helm chart doesn't yet create the required secrets.

kubectl create secret generic monitor-certificates \
    --from-file=monitor.crt.pem \
    --from-file=monitor.key.pem \
    -n monitoring

Where monitor.crt.pem and monitor.key.pem are the SSL certificate and key for the monitor account.

I've found two grafana dashboards for the blackbox exporter. 7587 didn't work straight out of the box while 5345 did. Taking as reference the grafana helm chart values, add the following yaml under the grafana key in the prometheus-operator values.yaml.

grafana:
  enabled: true
  defaultDashboardsEnabled: true
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      blackbox-exporter:
        # Ref: https://grafana.com/dashboards/5345
        gnetId: 5345
        revision: 3
        datasource: Prometheus

And install.

helmfile diff
helmfile apply

Blackbox exporter probes⚑

Modules define how blackbox exporter is going to query the endpoint, therefore one needs to be created for each request type under the config.modules section of the chart.

The modules are then used in the targets section for the desired endpoints.

  targets:
   - name: lyz-code.github.io/blue-book
     url: https://lyz-code.github.io/blue-book
     module: https_2xx

HTTP endpoint working correctly⚑

http_2xx:
  prober: http
  timeout: 5s
  http:
    valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
    valid_status_codes: [200]
    no_follow_redirects: false
    preferred_ip_protocol: "ip4"

HTTPS endpoint working correctly⚑

https_2xx:
  prober: http
  timeout: 5s
  http:
    method: GET
    fail_if_ssl: false
    fail_if_not_ssl: true
    valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
    valid_status_codes: [200]
    no_follow_redirects: false
    preferred_ip_protocol: "ip4"

HTTPS endpoint behind client SSL certificate⚑

https_client_2xx:
  prober: http
  timeout: 5s
  http:
    method: GET
    fail_if_ssl: false
    fail_if_not_ssl: true
    valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
    valid_status_codes: [200]
    no_follow_redirects: false
    preferred_ip_protocol: "ip4"
    tls_config:
      cert_file: /etc/secrets/monitor.crt.pem
      key_file: /etc/secrets/monitor.key.pem

Where the secrets have been created throughout the installation.

HTTPS endpoint with an specific error⚑

If you don't want to configure the authentication for example for an API, you can fetch the expected error.

https_client_api:
  prober: http
  timeout: 5s
  http:
    method: GET
    fail_if_ssl: false
    fail_if_not_ssl: true
    valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
    valid_status_codes: [404]
    no_follow_redirects: false
    preferred_ip_protocol: "ip4"
    fail_if_body_not_matches_regexp:
      - '.*ERROR route not.*'

HTTP endpoint returning an error⚑

http_4xx:
  prober: http
  timeout: 5s
  http:
    method: HEAD
    valid_status_codes: [404, 403]
    valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
    no_follow_redirects: false

HTTPS endpoint through an HTTP proxy⚑

https_external_2xx:
  prober: http
  timeout: 5s
  http:
    method: GET
    fail_if_ssl: false
    fail_if_not_ssl: true
    valid_http_versions: ["HTTP/1.0", "HTTP/1.1", "HTTP/2.0"]
    valid_status_codes: [200]
    no_follow_redirects: false
    proxy_url: "http://{{ proxy_url }}:{{ proxy_port }}"
    preferred_ip_protocol: "ip4"

HTTPS endpoint with basic auth⚑

https_basic_auth_2xx:
    prober: http
    timeout: 5s
    http:
        method: GET
        fail_if_ssl: false
        fail_if_not_ssl: true
        valid_http_versions:
        - HTTP/1.1
        - HTTP/2.0
        valid_status_codes:
        - 200
        no_follow_redirects: false
        preferred_ip_protocol: ip4
        basic_auth:
          username: {{ username }}
          password: {{ password }}

HTTPs endpoint with API key⚑

https_api_2xx:
    prober: http
    timeout: 5s
    http:
        method: GET
        fail_if_ssl: false
        fail_if_not_ssl: true
        valid_http_versions:
        - HTTP/1.1
        - HTTP/2.0
        valid_status_codes:
        - 200
        no_follow_redirects: false
        preferred_ip_protocol: ip4
        headers:
            apikey: {{ api_key }}

HTTPS Put file⚑

Test if the probe can upload a file.

    https_put_file_2xx:
      prober: http
      timeout: 5s
      http:
        method: PUT
        body: hi
        fail_if_ssl: false
        fail_if_not_ssl: true
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        valid_status_codes: [200]
        no_follow_redirects: false
        preferred_ip_protocol: "ip4"

Check open port⚑

tcp_connect:
  prober: tcp

The port is specified when using the module.

- name: lyz-code.github.io
  url: lyz-code.github.io:389
  module: tcp_connect

Check TCP with TLS ⚑

If you want to test for example if an LDAP is serving the correct certificate on the port 636 you can use:

tcp_ssl_connect:
  prober: tcp
  timeout: 10s
  tls: true

- name: Ldap
  url: my-ldap-server:636
  module: tcp_ssl_connect

Ping to the resource⚑

Test if the target is alive. It's useful When you don't know what port to check or if it uses UDP.

ping:
  prober: icmp
  timeout: 5s
  icmp:
    preferred_ip_protocol: "ip4"

Blackbox exporter alerts⚑

Now that we've got the metrics, we can define the alert rules. Most have been tweaked from the Awesome prometheus alert rules collection.

To make security tests

Availability alerts⚑

The most basic probes, test if the service is up and returning.

Blackbox probe failed⚑

Blackbox probe failed.

  - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Blackbox probe failed (instance {{ $labels.target }})"
      message: "Probe failed\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_success+%3D%3D+0&g0.tab=1"

If you use the security alerts, use the following expr: instead

    expr: probe_success{target!~".*-fail-.*$"} == 0

Blackbox probe HTTP failure⚑

HTTP status code is not 200-399.

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Blackbox probe HTTP failure (instance {{ $labels.target }})"
      message: "HTTP status code is not 200-399\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C+86400+%2A+30&g0.tab=1"

Performance alerts⚑

Blackbox slow probe⚑

Blackbox probe took more than 1s to complete.

  - alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox slow probe (target {{ $labels.target }})"
      message: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=avg_over_time%28probe_duration_seconds%5B1m%5D%29+%3E+1&g0.tab=1"

If you use the security alerts, use the following expr: instead

    expr: avg_over_time(probe_duration_seconds{,target!~".*-fail-.*"}[1m]) > 1

Blackbox probe slow HTTP⚑

HTTP request took more than 1s.

  - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox probe slow HTTP (instance {{ $labels.target }})"
      message: "HTTP request took more than 1s\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=avg_over_time%28probe_http_duration_seconds%5B1m%5D%29+%3E+1&g0.tab=1"

If you use the security alerts, use the following expr: instead

    expr: avg_over_time(probe_http_duration_seconds{,target!~".*-fail-.*"}[1m]) > 1

Blackbox probe slow ping⚑

Blackbox ping took more than 1s.

  - alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox probe slow ping (instance {{ $labels.target }})"
      message: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=avg_over_time%28probe_icmp_duration_seconds%5B1m%5D%29+%3E+1&g0.tab=1"

SSL certificate alerts⚑

Blackbox SSL certificate will expire in a month⚑

SSL certificate expires in 30 days.

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.target }})"
      message: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C+86400+%2A+30&g0.tab=1"

Blackbox SSL certificate will expire in a few days⚑

SSL certificate expires in 3 days.

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.target }})"
      message: "SSL certificate expires in 3 days\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C+86400+%2A+3&g0.tab=1"

Blackbox SSL certificate expired⚑

SSL certificate has expired already.

  - alert: BlackboxSslCertificateExpired
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Blackbox SSL certificate expired (instance {{ $labels.target }})"
      message: "SSL certificate has expired already\n  VALUE = {{ $value }}"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Security alerts⚑

To define the security alerts, I've found easier to create a probe with the action I want to prevent and make sure that the probe fails.

This probes contain the -fail- key in the target name, followed by the test it's performing. This convention allows the concatenation of tests. For example, when testing if and endpoint is accessible without basic auth and without vpn we'd use:

- name: protected.endpoint.org-fail-without-ssl-and-without-credentials
  url: protected.endpoint.org
  module: https_external_2xx

Test endpoints protected with network policies⚑

Assuming that the blackbox exporter is in the internal network and that there is an http proxy on the external network we want to test. Create a working probe with the https_external_2xx module containing the -fail-without-vpn key in the target name.

  - alert: BlackboxVPNProtectionRemoved
    expr: probe_success{target=~".*-fail-.*without-vpn.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "VPN protection was removed from (instance {{ $labels.target }})"
      message: "Successful probe to the endpoint from outside the internal network"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Test endpoints protected with SSL client certificate⚑

Create a working probe with a module without the SSL client certificate configured, such as https_2xx and set the -fail-without-ssl key in the target name.

  - alert: BlackboxClientSSLProtectionRemoved
    expr: probe_success{target=~".*-fail-.*without-ssl.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "SSL client certificate protection was removed from (instance {{ $labels.target }})"
      message: "Successful probe to the endpoint without SSL certificate"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Test endpoints protected with credentials.⚑

Create a working probe with a module without the basic auth credentials configured, such as https_2xx and set the -fail-without-credentials key in the target name.

  - alert: BlackboxCredentialsProtectionRemoved
    expr: probe_success{target=~".*-fail-.*without-credentials.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Credentials protection was removed from (instance {{ $labels.target }})"
      message: "Successful probe to the endpoint without credentials"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Test endpoints protected with WAF.⚑

Create a working probe with a module bypassing the WAF, for example directly attacking the service and set the -fail-without-waf key in the target name.

  - alert: BlackboxWAFProtectionRemoved
    expr: probe_success{target=~".*-fail-.*without-waf.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "WAF protection was removed from (instance {{ $labels.target }})"
      message: "Successful probe to the haproxy endpoint from the internal network (bypassed the WAF)"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Unauthorized read of S3 buckets⚑

Create a working probe to an existent private object in an S3 bucket and set the -fail-read-object key in the target name.

  - alert: BlackboxS3BucketWrongReadPermissions
    expr: probe_success{target=~".*-fail-.*read-object.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Wrong read permissions on S3 bucket (instance {{ $labels.target }})"
      message: "Successful read of a private object with an unauthenticated user"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Unauthorized write of S3 buckets⚑

Create a working probe using the https_put_file_2xx module to try to create a file in an S3 bucket and set the -fail-write-object key in the target name.

  - alert: BlackboxS3BucketWrongWritePermissions
    expr: probe_success{target=~".*-fail-.*write-object.*"} == 1
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Wrong write permissions on S3 bucket (instance {{ $labels.target }})"
      message: "Successful write of a private object with an unauthenticated user"
      grafana: "{{ grafana_url }}&var-targets={{ $labels.target }}"
      prometheus: "{{ prometheus_url }}/graph?g0.expr=probe_ssl_earliest_cert_expiry+-+time%28%29+%3C%3D+0&g0.tab=1"

Monitoring external access to internal services⚑

There are two possible solutions to simulate traffic from outside your infrastructure to the internal services. Both require the installation of an agent outside of your internal infrastructure, it can be:

An HTTP proxy.
A blackbox exporter instance.

Using the proxy you have following advantages:

It's really easy to set up a transparent http proxy.
All probe configuration goes in the same blackbox exporter instance values.yaml.

With the following disadvantages:

When using an external http proxy, the probe runs the DNS resolution locally. Therefore if the record doesn't exist in the local server the probe will fail, even if the proxy DNS resolver has the correct record.

The ugly workaround I've implemented is to create a "fake" DNS record in my internal DNS server so the probe sees it exist. * There is no way to do tcp or ping probes to simulate external traffic. * The latency between the blackbox exporter and the proxy is added to all the external probes.

While using an external blackbox exporter gives the following advantages:

Traffic is completely external to the infrastructure, so the proxy disadvantages would be solved.

And the following disadvantages:

Simulation of external traffic in AWS could be done by spawning the blackbox exporter instance in another region, but as there is no way of using EKS worker nodes in different regions, there is no way of managing the exporter from within Kubernetes. This means:
- The loose of the advantages of the Prometheus operator, so we have to write the configuration manually.
- Configuration can't be managed with Helm, so two solutions should be used to manage the monitorization (Ansible could be used).
- Even if it's possible to host the second external blackbox exporter within Kubernetes, two independent Helm charts are needed, with the consequent configuration management burden.

In conclusion, when using a Kubernetes cluster that allows the creation of worker nodes outside the main infrastructure, or if several non HTTP/HTTPS endpoints need to be probed with the tcp or ping modules, install an external blackbox exporter instance. Otherwise install an HTTP proxy and assume that you can only simulate external HTTP/HTTPS traffic.

Troubleshooting⚑

To get more debugging information of the blackbox probes, add &debug=true to the probe url, for example http://localhost:9115/probe?module=http_2xx&target=https://www.prometheus.io/&debug=true .

Service monitors are not being created ⚑

When running helmfile apply several times to update the resources, some are not being correctly created. Until the bug is solved, a workaround is to remove the chart release helm delete --purge prometeus-blackbox-exporter and running helmfile apply again.

probe_success == 0 when using an http proxy ⚑

Even when using an external http proxy, the probe runs the DNS resolution locally. Therefore if the record doesn't exist in the local server the probe will fail, even if the proxy DNS resolver has the correct record.

The ugly workaround I've implemented is to create a "fake" DNS record in my internal DNS server so the probe sees it exist.

Blackbox Exporter

Installation⚑

Blackbox exporter probes⚑

HTTP endpoint working correctly⚑

HTTPS endpoint working correctly⚑

HTTPS endpoint behind client SSL certificate⚑

HTTPS endpoint with an specific error⚑

HTTP endpoint returning an error⚑

HTTPS endpoint through an HTTP proxy⚑

HTTPS endpoint with basic auth⚑

HTTPs endpoint with API key⚑

HTTPS Put file⚑

Check open port⚑

Check TCP with TLS⚑

Ping to the resource⚑

Blackbox exporter alerts⚑

Availability alerts⚑

Blackbox probe failed⚑

Blackbox probe HTTP failure⚑

Performance alerts⚑

Blackbox slow probe⚑

Blackbox probe slow HTTP⚑

Blackbox probe slow ping⚑

SSL certificate alerts⚑

Blackbox SSL certificate will expire in a month⚑

Blackbox SSL certificate will expire in a few days⚑

Blackbox SSL certificate expired⚑

Security alerts⚑

Test endpoints protected with network policies⚑

Test endpoints protected with SSL client certificate⚑

Test endpoints protected with credentials.⚑

Test endpoints protected with WAF.⚑

Unauthorized read of S3 buckets⚑

Unauthorized write of S3 buckets⚑

Monitoring external access to internal services⚑

Troubleshooting⚑

Service monitors are not being created⚑

probe_success == 0 when using an http proxy⚑

Links⚑

Check TCP with TLS ⚑

Service monitors are not being created ⚑

probe_success == 0 when using an http proxy ⚑