Skip to content

Prod / dev divergence monitoring #315

@bodymindarts

Description

@bodymindarts

The settings in the default monitoring/values.yml are currently somewhat out of date.
Many values are being overridden in production.

Ideally we would like to:

  • have less prod overrides (ie. bring as many of the prod settings as possible into the defaults)
  • have a setup that verifiably works in the dev setup - in particular the values coming from the blackbox exporter (that monitor the main backend) should work locally.

Here are the current production overrides - note that some values are injected via terraform templates (eg ${graphql_playground_url}) - don't know how best to set defaults for that. At least for dev setup we should probably hard code the values.

prometheus:
  extraScrapeConfigs: |
    - job_name: 'prometheus-blackbox-exporter-noauth'
      metrics_path: /probe
      params:
        module: [buildParameters]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
    - job_name: 'prometheus-blackbox-exporter-auth'
      scrape_timeout: 30s
      metrics_path: /probe
      params:
        module: [walletAuth]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115

  alerts:
    groups:
    - name: Ingress Controller
      rules:
      - alert: NGINXTooMany500s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 5XXs
          summary: More than 5% of all requests returned 5XX
      - alert: NGINXTooMany400s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 4XXs
          summary: More than 5% of all requests returned 4XX
    - name: ${instance_name}
      rules:
      - alert: PodRestart
        expr: increase(kube_pod_container_status_restarts_total{namespace=~'${galoy_namespace}|${bitcoin_namespace}'}[10m]) >= 2
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} restarted too many times"
      - alert: PodStartupError
        for: 1m
        expr: kube_pod_container_status_waiting_reason{reason!="ContainerCreating",namespace=~'${galoy_namespace}|${bitcoin_namespace}'} == 1
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} is unable to start"
      - alert: GraphqlIssue
        for: 3m
        expr: probe_success{job="prometheus-blackbox-exporter-mainnet"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
      - alert: GraphqlNoAuthIssue
        for: 3m
        expr: probe_success{namespace=~'${galoy_namespace}', job="prometheus-blackbox-exporter-noauth"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
  
  alertmanagerFiles:
    alertmanager.yml:
      global:
        pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
      route:
        group_wait: 10s
        group_interval: 10m
        receiver: slack
        repeat_interval: 6h
        routes:
        - receiver: slack-pagerduty
          matchers:
            - severity="critical"
          group_interval: 2m

prometheus-blackbox-exporter:
  secretConfig: true
  config:
    modules:
      buildParameters:
        prober: http
        timeout: 3s
        http:
          method: POST
          headers:
            Content-Type: application/json
          body: '{"query":"query buildParameters { buildParameters { id commitHash buildTime helmRevision minBuildNumberAndroid minBuildNumberIos lastBuildNumberAndroid lastBuildNumberIos }}","variables":{}}'
      walletAuth:
        prober: http
        timeout: 30s
        http:
          method: POST
          fail_if_body_matches_regexp:
            - "errors+"
          headers:
            Content-Type: application/json
          body: '{"query":"query gql_query_logged { prices { __typename id o } earnList { __typename id value completed } wallet { __typename id balance currency transactions { __typename id amount description created_at hash type usd fee feeUsd pending } } getLastOnChainAddress { __typename id } me { __typename id level username phone } maps { __typename id title coordinate { __typename latitude longitude } } nodeStats { __typename id } }","variables":{}}'

Another file containing sensitive information is also merged in:

prometheus:
  alertmanagerFiles:
    alertmanager.yml:
      global:
        slack_api_url: ${slack_api_url}
      receivers:
        - name: slack
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true
        - name: slack-pagerduty
          pagerduty_configs:
          - service_key: ${pagerduty_service_key}
            send_resolved: true
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true

prometheus-blackbox-exporter:
  config:
    modules:
      walletAuth:
        http:
          headers:
            Authorization: Bearer ${probe_auth_token}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions