-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
The settings in the default monitoring/values.yml are currently somewhat out of date.
Many values are being overridden in production.
Ideally we would like to:
- have less prod overrides (ie. bring as many of the prod settings as possible into the defaults)
- have a setup that verifiably works in the dev setup - in particular the values coming from the blackbox exporter (that monitor the main backend) should work locally.
Here are the current production overrides - note that some values are injected via terraform templates (eg ${graphql_playground_url}) - don't know how best to set defaults for that. At least for dev setup we should probably hard code the values.
prometheus:
extraScrapeConfigs: |
- job_name: 'prometheus-blackbox-exporter-noauth'
metrics_path: /probe
params:
module: [buildParameters]
static_configs:
- targets:
- ${graphql_playground_url}
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: monitoring-prometheus-blackbox-exporter:9115
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- job_name: 'prometheus-blackbox-exporter-auth'
scrape_timeout: 30s
metrics_path: /probe
params:
module: [walletAuth]
static_configs:
- targets:
- ${graphql_playground_url}
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: monitoring-prometheus-blackbox-exporter:9115
alerts:
groups:
- name: Ingress Controller
rules:
- alert: NGINXTooMany500s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: critical
annotations:
description: Too many 5XXs
summary: More than 5% of all requests returned 5XX
- alert: NGINXTooMany400s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: critical
annotations:
description: Too many 4XXs
summary: More than 5% of all requests returned 4XX
- name: ${instance_name}
rules:
- alert: PodRestart
expr: increase(kube_pod_container_status_restarts_total{namespace=~'${galoy_namespace}|${bitcoin_namespace}'}[10m]) >= 2
labels:
severity: critical
annotations:
summary: "{{$labels.container}} restarted too many times"
- alert: PodStartupError
for: 1m
expr: kube_pod_container_status_waiting_reason{reason!="ContainerCreating",namespace=~'${galoy_namespace}|${bitcoin_namespace}'} == 1
labels:
severity: critical
annotations:
summary: "{{$labels.container}} is unable to start"
- alert: GraphqlIssue
for: 3m
expr: probe_success{job="prometheus-blackbox-exporter-mainnet"} == 0
labels:
severity: critical
annotations:
summary: "Graphql is down"
- alert: GraphqlNoAuthIssue
for: 3m
expr: probe_success{namespace=~'${galoy_namespace}', job="prometheus-blackbox-exporter-noauth"} == 0
labels:
severity: critical
annotations:
summary: "Graphql is down"
alertmanagerFiles:
alertmanager.yml:
global:
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
group_wait: 10s
group_interval: 10m
receiver: slack
repeat_interval: 6h
routes:
- receiver: slack-pagerduty
matchers:
- severity="critical"
group_interval: 2m
prometheus-blackbox-exporter:
secretConfig: true
config:
modules:
buildParameters:
prober: http
timeout: 3s
http:
method: POST
headers:
Content-Type: application/json
body: '{"query":"query buildParameters { buildParameters { id commitHash buildTime helmRevision minBuildNumberAndroid minBuildNumberIos lastBuildNumberAndroid lastBuildNumberIos }}","variables":{}}'
walletAuth:
prober: http
timeout: 30s
http:
method: POST
fail_if_body_matches_regexp:
- "errors+"
headers:
Content-Type: application/json
body: '{"query":"query gql_query_logged { prices { __typename id o } earnList { __typename id value completed } wallet { __typename id balance currency transactions { __typename id amount description created_at hash type usd fee feeUsd pending } } getLastOnChainAddress { __typename id } me { __typename id level username phone } maps { __typename id title coordinate { __typename latitude longitude } } nodeStats { __typename id } }","variables":{}}'
Another file containing sensitive information is also merged in:
prometheus:
alertmanagerFiles:
alertmanager.yml:
global:
slack_api_url: ${slack_api_url}
receivers:
- name: slack
slack_configs:
- channel: '#${slack_alerts_channel_name}'
title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
send_resolved: true
- name: slack-pagerduty
pagerduty_configs:
- service_key: ${pagerduty_service_key}
send_resolved: true
slack_configs:
- channel: '#${slack_alerts_channel_name}'
title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
send_resolved: true
prometheus-blackbox-exporter:
config:
modules:
walletAuth:
http:
headers:
Authorization: Bearer ${probe_auth_token}
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels