Skip to content

Elasticsearch Exporter

The elasticsearch exporter allows monitoring Elasticsearch clusters with Prometheus.

Installation

To install the exporter we'll use helmfile to install the prometheus-elasticsearch-exporter chart.

Add the following lines to your helmfile.yaml.

- name: prometheus-elasticsearch-exporter
  namespace: monitoring
  chart: prometheus-community/prometheus-elasticsearch-exporter
  values:
    - prometheus-elasticsearch-exporter/values.yaml

Edit the chart values.

mkdir prometheus-elasticsearch-exporter
helm inspect values prometheus-community/prometheus-elasticsearch-exporter > prometheus-elasticsearch-exporter/values.yaml
vi prometheus-elasticsearch-exporter/values.yaml

Comment out all the values you don't edit, so that the chart doesn't break when you upgrade it.

Make sure that the serviceMonitor labels match your Prometheus serviceMonitorSelector otherwise they won't be added to the configuration.

es:
  ## Address (host and port) of the Elasticsearch node we should connect to.
  ## This could be a local node (localhost:9200, for instance), or the address
  ## of a remote Elasticsearch server. When basic auth is needed,
  ## specify as: <proto>://<user>:<password>@<host>:<port>. e.g., http://admin:pass@localhost:9200.
  ##
  uri: http://localhost:9200

serviceMonitor:
  ## If true, a ServiceMonitor CRD is created for a prometheus operator
  ## https://github.com/coreos/prometheus-operator
  ##
  enabled: true
  #  namespace: monitoring
  labels:
    release: prometheus-operator
  interval: 30s
  # scrapeTimeout: 10s
  # scheme: http
  # relabelings: []
  # targetLabels: []
  metricRelabelings:
    - sourceLabels: [cluster]
      targetLabel: cluster_name
      regex: '.*:(.*)'
  # sampleLimit: 0

You can build the cluster label following this instructions, I didn't find the required meta tags, so I've built the cluster_name label for alerting purposes.

The grafana dashboard I chose is 2322. Taking as reference the grafana helm chart values, add the next yaml under the grafana key in the prometheus-operator values.yaml.

grafana:
  enabled: true
  defaultDashboardsEnabled: true
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      elasticsearch:
        # Ref: https://grafana.com/dashboards/2322
        gnetId: 2322
        revision: 4
        datasource: Prometheus

And install.

helmfile diff
helmfile apply

Elasticsearch exporter alerts

Now that we've got the metrics, we can define the alert rules. Most have been tweaked from the Awesome prometheus alert rules collection.

Availability alerts

The most basic probes, test if the service is healthy

- alert: ElasticsearchClusterRed
  expr: elasticsearch_cluster_health_status{color="red"} == 1
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Cluster Red
      (cluster {{ $labels.cluster_name }})
    description: |
      Elastic Cluster Red status
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchClusterYellow
  expr: elasticsearch_cluster_health_status{color="yellow"} == 1
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch Cluster Yellow
      (cluster {{ $labels.cluster_name }})
    description: |
      Elastic Cluster Yellow status
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchHealthyNodes
  expr: elasticsearch_cluster_health_number_of_nodes < 3
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Healthy Nodes
      (cluster {{ $labels.cluster_name }})
    description: |
      Missing node in Elasticsearch cluster
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchHealthyMasterNodes
  expr: >
    elasticsearch_cluster_health_number_of_nodes
    - elasticsearch_cluster_health_number_of_data_nodes > 0 < 3
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Healthy Master Nodes < 3
      (cluster {{ $labels.cluster_name }})
    description: |
      Missing master node in Elasticsearch cluster
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchHealthyDataNodes
  expr: elasticsearch_cluster_health_number_of_data_nodes < 3
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Healthy Data Nodes
      (cluster {{ $labels.cluster_name }})
    description: |
      Missing data node in Elasticsearch cluster
      VALUE = {{ $value }}
      LABELS = {{ $labels }}

Performance alerts

- alert: ElasticsearchCPUUsageTooHigh
  expr: elasticsearch_os_cpu_percent > 90
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Node CPU Usage Too High
      (cluster {{ $labels.cluster_name }} node {{ $labels.name }})
    description: |
      The CPU usage of node {{ $labels.name }} is over 90%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchCPUUsageWarning
  expr: elasticsearch_os_cpu_percent > 80
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch Node CPU Usage Too High
      (cluster {{ $labels.cluster_name }} node {{ $labels.name }})
    description: |
      The CPU usage of node {{ $labels.name }} is over 90%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchHeapUsageTooHigh
  expr: >
    (
      elasticsearch_jvm_memory_used_bytes{area="heap"}
      / elasticsearch_jvm_memory_max_bytes{area="heap"}
    ) * 100 > 90
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch Node Heap Usage Critical
      (cluster {{ $labels.cluster_name }} node {{ $labels.name }})
    description: |
      The heap usage of node {{ $labels.name }} is over 90%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchHeapUsageWarning
  expr: >
    (
      elasticsearch_jvm_memory_used_bytes{area="heap"}
      / elasticsearch_jvm_memory_max_bytes{area="heap"}
    ) * 100 > 80
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch Node Heap Usage Warning
      (cluster {{ $labels.cluster_name }} node {{ $labels.name }})
    description: |
      The heap usage of node {{ $labels.name }} is over 80%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchDiskOutOfSpace
  expr: >
    elasticsearch_filesystem_data_available_bytes
    / elasticsearch_filesystem_data_size_bytes * 100 < 10
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch disk out of space
      (cluster {{ $labels.cluster_name }})
    description: |
      The disk usage is over 90%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchDiskSpaceLow
  expr: >
    elasticsearch_filesystem_data_available_bytes
    / elasticsearch_filesystem_data_size_bytes * 100 < 20
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch disk space low
      (cluster {{ $labels.cluster_name }})
    description: |
      The disk usage is over 80%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchRelocatingShardsTooLong
  expr: elasticsearch_cluster_health_relocating_shards > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch relocating shards too long
      (cluster {{ $labels.cluster_name }})
    description: |
      Elasticsearch has been relocating shards for 15min
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchInitializingShardsTooLong
  expr: elasticsearch_cluster_health_initializing_shards > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch initializing shards too long
      (cluster_name {{ $labels.cluster }})
    description: |
      Elasticsearch has been initializing shards for 15 min
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchUnassignedShards
  expr: elasticsearch_cluster_health_unassigned_shards > 0
  for: 0m
  labels:
    severity: critical
  annotations:
    summary: >
      Elasticsearch unassigned shards
      (cluster {{ $labels.cluster_name }})
    description: |
      Elasticsearch has unassigned shards
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchPendingTasks
  expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch pending tasks
      (cluster {{ $labels.cluster_name }})
    description: |
      Elasticsearch has pending tasks. Cluster works slowly.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}

- alert: ElasticsearchCountOfJVMGarbageCollectorRuns
  expr: rate(elasticsearch_jvm_gc_collection_seconds_count{}[5m]) > 5
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch JVM Garbage Collector runs > 5
      (cluster {{ $labels.cluster_name }})
    description: |
      Elastic Cluster JVM Garbage Collector runs > 5
      VALUE = {{ $value }}
      LABELS = {{ $labels }}

- alert: ElasticsearchCountOfJVMGarbageCollectorTime
  expr: rate(elasticsearch_jvm_gc_collection_seconds_sum[5m]) > 0.3
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch JVM Garbage Collector time > 0.3
      (cluster {{ $labels.cluster_name }})
    description: |
      Elastic Cluster JVM Garbage Collector runs > 0.3
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchJSONParseErrors
  expr: elasticsearch_cluster_health_json_parse_failures > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch json parse error
      (cluster {{ $labels.cluster_name }})
    description: |
      Elasticsearch json parse error
      VALUE = {{ $value }}
      LABELS = {{ $labels }}
- alert: ElasticsearchCircuitBreakerTripped
  expr: rate(elasticsearch_breakers_tripped{}[5m])>0
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch breaker {{ $labels.breaker }} tripped
      (cluster {{ $labels.cluster_name }}, node {{ $labels.name }})
    description: |
      Elasticsearch breaker {{ $labels.breaker }} tripped
      (cluster {{ $labels.cluster_name }}, node {{ $labels.name }})
      VALUE = {{ $value }}
      LABELS = {{ $labels }}

Snapshot alerts

- alert: ElasticsearchMonthlySnapshot
  expr: >
    time() -
    elasticsearch_snapshot_stats_snapshot_end_time_timestamp{state="SUCCESS"}
    > (3600 * 24 * 32)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch monthly snapshot failed
      (cluster {{ $labels.cluster_name }},
      snapshot {{ $labels.repository }})
    description: |
      Last successful elasticsearch snapshot
      of repository {{ $labels.repository}} is older than 32 days.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}

- record: elasticsearch_indices_search_latency:rate1m
  expr: |
    increase(elasticsearch_indices_search_query_time_seconds[1m])/
    increase(elasticsearch_indices_search_query_total[1m])
- record: elasticsearch_indices_search_rate:rate1m
  expr: increase(elasticsearch_indices_search_query_total[1m])/60
- alert: ElasticsearchSlowSearchLatency
  expr: elasticsearch_indices_search_latency:rate1m > 1
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: >
      Elasticsearch search latency is greater than 1 s
      (cluster {{ $labels.cluster_name }}, node {{ $labels.name }})
    description: |
      Elasticsearch search latency is greater than 1 s
      (cluster {{ $labels.cluster_name }}, node {{ $labels.name }})
      VALUE = {{ $value }}
      LABELS = {{ $labels }}