agrim (Everything is a file)

Extensively monitoring spark-jobs with StatsD

This post is a continuation with Monitoring Spark jobs with Prometheus StatsD Exporter and Grafana

In the previous post, we were just running statsd on the master node. Our aim here is for every node to have its statsd sidecar and prometheus scraping metrics from all of them.

A high-level diagram of what we are trying to do here

        (Executor 1)                          (Executor 2)
 [Service -> StatsdSink over UDP]    [Service -> StatsdSink over UDP]
             |                                     |
             ▼                                     ▼
 [StatsD Prometheus Exporter]          [StatsD Prometheus Exporter]
             |                                     |
             |                    _________________|
             |                    |
             ▼                    ▼
        +----------------------------+    
        |    Prometheus Scraper      |        +----------+  
        |         using EC2          |-------►|  Grafana |
        |     service discovery      |        +----------+
        +----------------------------+
             ▲                    ▲
             |                    |________________
             |                                     |
             |                                     |
 [StatsD Prometheus Exporter]          [StatsD Prometheus Exporter]
             ▲                                     ▲
             |                                     |
 [Service -> StatsdSink over UDP]    [Service -> StatsdSink over UDP]
        (Executor 3)                          (Executor 4)

We need statsd daemon on every executor.

docker pull prom/statsd-exporter:v0.14.1
docker run -d -p 9102:9102 -p 9125:9125 -p 9125:9125/udp --name statsd prom/statsd-exporter:v0.14.1

metrics.properties

*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink
*.sink.statsd.prefix=spark
*.sink.statsd.host=127.0.0.1
*.sink.statsd.port=9125

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 15s

scrape_configs:
  - job_name: "my-job"
    metrics_path: /metrics
    params:
      format: ["prometheus"]
    ec2_sd_configs:
      - region: "<aws-region>"
        port: 9102
        refresh_interval: 15m
    relabel_configs:
      - source_labels: [ __meta_ec2_tag_Name ]
        regex: my-job
        action: keep
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance

EC2 tag-based discovery

EC2 SD configurations allow retrieving scrape targets from AWS EC2 instances.

References