-
Notifications
You must be signed in to change notification settings - Fork 0
Alerting
This guide covers the Prometheus AlertManager integration for Grimnir Radio, including alert rules, configuration, and best practices.
- Overview
- Quick Start
- Alert Rules
- AlertManager Configuration
- Notification Channels
- Alert Severity Levels
- Customization
- Troubleshooting
Grimnir Radio includes comprehensive alerting rules for monitoring system health, performance, and audio quality. The alerting system uses:
- Prometheus - Metric collection and alert rule evaluation
- AlertManager - Alert routing, grouping, and notification
- Multiple notification channels - Slack, PagerDuty, Email, Webhooks
┌─────────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Grimnir Radio │────▶│ Prometheus │────▶│ AlertManager │
│ /metrics │ │ (rules eval) │ │ (routing/notify) │
└─────────────────┘ └──────────────┘ └──────────────────┘
│
┌──────────────────────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌────────┐
│ Slack │ │ Email │ │Webhook │
└──────────┘ └─────────┘ └────────┘
Docker Compose:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./deploy/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./deploy/prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./deploy/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
prometheus-data:
alertmanager-data:Manual Installation:
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64
# Copy alert rules
cp /path/to/grimnir_radio/deploy/prometheus/alerts.yml .
cp /path/to/grimnir_radio/deploy/prometheus/prometheus.yml .
# Start Prometheus
./prometheus --config.file=prometheus.yml
# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
cd alertmanager-0.26.0.linux-amd64
# Copy configuration
cp /path/to/grimnir_radio/deploy/alertmanager/alertmanager.yml .
# Start AlertManager
./alertmanager --config.file=alertmanager.ymlEnsure metrics are exposed:
# Control plane metrics on :8080/metrics
curl http://localhost:8080/metrics
# Media engine metrics on :9092/metrics
curl http://localhost:9092/metricsCheck that Prometheus loaded the rules:
curl http://localhost:9090/api/v1/rules | jqTrigger a test alert by stopping the media engine:
# Stop media engine
pkill mediaengine
# Wait 1 minute, then check alerts in Prometheus UI
# http://localhost:9090/alerts
# Check AlertManager UI
# http://localhost:9093| Alert | Condition | For | Description |
|---|---|---|---|
| MediaEngineDown | Connection status = 0 | 1m | Media engine gRPC connection lost |
| MediaEnginePipelineRestartLoop | Restart rate > 0.1/sec | 2m | Pipeline restarting frequently |
| ScheduleGap | Schedule entries = 0 | 5m | No scheduled content |
| AudioSilence | Both channels < -60dBFS | 30s | Dead air detected |
| WebstreamFailoverLoop | Failover rate > 0.05/sec | 5m | All webstream sources failing |
| APIHighErrorRate | 5xx errors > 5% | 3m | API returning server errors |
| Alert | Condition | For | Description |
|---|---|---|---|
| PlayoutUnderrunHigh | Underrun rate > 0.05/sec | 2m | Audio dropouts occurring |
| PlayoutBufferLow | Buffer < 4800 samples | 30s | Risk of dropout |
| SchedulerErrors | Error rate > 0.01/sec | 2m | Scheduler experiencing errors |
| WebstreamUnhealthy | Health status = 0 | 2m | Webstream failing health checks |
| AudioLevelLow | Level < -40dBFS | 2m | Audio level too quiet |
| APIHighLatency | P99 latency > 2s | 5m | API responding slowly |
| Alert | Condition | For | Description |
|---|---|---|---|
| LiveSessionStuck | Session active unchanged | 15m | DJ session running long |
| AudioLoudnessOutOfRange | LUFS < -23 or > -14 | 5m | Loudness outside broadcast standards |
Alerts are routed based on:
- Severity: Critical, Warning, Info
- Component: media_engine, scheduler, api, playout, etc.
- Alert Name: Specific alert types
Alerts are grouped by:
-
alertname- Same type of alert -
station_id- Affect same station -
component- Same system component
This prevents notification spam when multiple related alerts fire.
Certain alerts suppress others to reduce noise:
| Source Alert | Suppresses | Reason |
|---|---|---|
| SchedulerErrors | ScheduleGap | Schedule gap is likely caused by scheduler errors |
| MediaEngineDown | Playout* alerts | Playout issues are caused by media engine being down |
| WebstreamFailoverLoop | WebstreamUnhealthy | Individual failures expected during failover loop |
| Severity | Group Wait | Group Interval | Repeat Interval |
|---|---|---|---|
| Critical | 0s (immediate) | 5m | 30m |
| Warning | 30s | 10s | 6h |
| Info | 5m | 10s | 24h |
Configure Slack webhook in alertmanager.yml:
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#grimnir-alerts-critical'
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Station:* {{ .Labels.station_id }}
*Description:* {{ .Annotations.description }}
{{ end }}Create Slack Webhook:
- Go to https://api.slack.com/apps
- Create new app
- Enable Incoming Webhooks
- Add webhook to workspace
- Copy webhook URL
Configure PagerDuty integration:
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
description: '{{ .GroupLabels.alertname }}'
severity: '{{ .CommonLabels.severity }}'Setup PagerDuty:
- Create integration in PagerDuty service
- Select "Prometheus" as integration type
- Copy integration key to
routing_key
Configure SMTP settings:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager@grimnir-radio.local'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'app-specific-password'
smtp_require_tls: true
receivers:
- name: 'critical'
email_configs:
- to: 'oncall@grimnir-radio.local'
headers:
Subject: '[CRITICAL] {{ .GroupLabels.alertname }}'Send alerts to custom endpoint:
webhook_configs:
- url: 'http://localhost:8080/api/v1/alerts/webhook'
send_resolved: true
http_config:
bearer_token: 'your-webhook-token-here'Webhook Payload Format:
{
"version": "4",
"groupKey": "{}:{alertname=\"MediaEngineDown\"}",
"status": "firing",
"receiver": "critical",
"groupLabels": {
"alertname": "MediaEngineDown"
},
"commonLabels": {
"alertname": "MediaEngineDown",
"severity": "critical",
"executor_id": "exec-123"
},
"commonAnnotations": {
"description": "The media engine gRPC connection is down for executor exec-123",
"summary": "Media Engine disconnected for executor exec-123"
},
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "MediaEngineDown",
"severity": "critical",
"executor_id": "exec-123"
},
"annotations": {
"description": "The media engine gRPC connection is down for executor exec-123",
"summary": "Media Engine disconnected for executor exec-123"
},
"startsAt": "2026-01-22T10:30:00Z",
"endsAt": "0001-01-01T00:00:00Z"
}
]
}Impact: Service degradation or outage Response Time: Immediate (< 5 minutes) Notification: Multiple channels (Slack, PagerDuty, Email) Examples:
- Media engine disconnected
- Audio silence
- No scheduled content
- High API error rate
Action Required: Immediate investigation and resolution
Impact: Degraded performance, risk of outage Response Time: Within 30 minutes Notification: Slack, Email Examples:
- High playout underruns
- Low buffer depth
- Scheduler errors
- High API latency
Action Required: Investigation during business hours, may escalate to critical
Impact: Informational, no immediate action required Response Time: Best effort Notification: Webhook only (logged) Examples:
- Long-running live session
- Audio loudness outside ideal range
Action Required: Review during normal operations, trend monitoring
Edit deploy/prometheus/alerts.yml:
- name: my_custom_alerts
interval: 30s
rules:
- alert: MyCustomAlert
expr: my_metric > threshold
for: 5m
labels:
severity: warning
component: my_component
annotations:
summary: "Short description"
description: "Detailed description with context"Alert Rule Best Practices:
- Use descriptive alert names (PascalCase)
- Set appropriate
forduration to avoid flapping - Include station_id in labels when applicable
- Provide actionable descriptions
- Use template variables in annotations
Common threshold adjustments:
# Increase buffer low threshold from 4800 to 9600 samples
- alert: PlayoutBufferLow
expr: grimnir_playout_buffer_depth_samples < 9600
for: 30s
# Reduce underrun sensitivity from 0.05 to 0.1/sec
- alert: PlayoutUnderrunHigh
expr: rate(grimnir_playout_dropout_count_total[5m]) > 0.1
for: 2m
# Change audio silence threshold from -60 to -50 dBFS
- alert: AudioSilence
expr: grimnir_media_engine_audio_level_left_db < -50 and grimnir_media_engine_audio_level_right_db < -50
for: 30sCreate template file deploy/alertmanager/templates/custom.tmpl:
{{ define "slack.custom.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "slack.custom.text" }}
{{ range .Alerts }}
*Station:* {{ .Labels.station_id }}
*Component:* {{ .Labels.component }}
*Description:* {{ .Annotations.description }}
*Time:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}Reference in alertmanager.yml:
templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: 'critical'
slack_configs:
- title: '{{ template "slack.custom.title" . }}'
text: '{{ template "slack.custom.text" . }}'Check Prometheus:
# Verify metrics are being scraped
curl http://localhost:9090/api/v1/targets
# Check if alert rules are loaded
curl http://localhost:9090/api/v1/rules | jq
# Query metric directly
curl 'http://localhost:9090/api/v1/query?query=grimnir_media_engine_connection_status'Common Issues:
- Metrics endpoint not accessible
- Alert rule syntax error
- Threshold not met
-
forduration not elapsed
Check AlertManager:
# View AlertManager status
curl http://localhost:9093/api/v1/status
# View active alerts
curl http://localhost:9093/api/v1/alerts
# View silences
curl http://localhost:9093/api/v1/silencesCommon Issues:
- AlertManager not configured in Prometheus
- Routing rule doesn't match alert labels
- Alert is silenced
- Receiver configuration error (webhook URL, API key, etc.)
Force an alert to fire:
# Stop media engine to trigger MediaEngineDown
systemctl stop grimnir-media-engine
# Check Prometheus alerts page
# http://localhost:9090/alerts
# Check AlertManager
# http://localhost:9093Create test alert rule:
- alert: TestAlert
expr: vector(1)
labels:
severity: warning
annotations:
description: "This is a test alert that always fires"Via AlertManager UI:
- Go to http://localhost:9093
- Click "Silence" button
- Set matcher (e.g.,
alertname="PlayoutUnderrunHigh") - Set duration
- Add comment
Via CLI:
amtool silence add alertname=PlayoutUnderrunHigh --duration=2h --comment="Maintenance window"Via API:
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [{"name": "alertname", "value": "PlayoutUnderrunHigh", "isRegex": false}],
"startsAt": "2026-01-22T10:00:00Z",
"endsAt": "2026-01-22T12:00:00Z",
"createdBy": "admin",
"comment": "Maintenance window"
}'Create webhook handler in Grimnir Radio API:
// internal/api/alerts.go
func (a *API) HandleAlertWebhook(w http.ResponseWriter, r *http.Request) {
var payload AlertManagerPayload
if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
http.Error(w, "Invalid payload", http.StatusBadRequest)
return
}
for _, alert := range payload.Alerts {
a.logger.Info().
Str("alert", alert.Labels["alertname"]).
Str("status", alert.Status).
Str("severity", alert.Labels["severity"]).
Str("station_id", alert.Labels["station_id"]).
Msg("received alert")
// Store alert in database, trigger remediation, etc.
}
w.WriteHeader(http.StatusOK)
}- Start with defaults - Use provided alert rules, tune thresholds based on your environment
- Group related alerts - Avoid notification spam by using proper grouping
-
Set appropriate
fordurations - Prevent flapping alerts - Test notification channels - Verify all receivers work before production
- Document escalation procedures - Define who handles what severity
- Review alerts regularly - Remove noisy alerts, add missing ones
- Use inhibition rules - Prevent cascading notifications
- Monitor AlertManager - Set up alerts for AlertManager itself
- Maintain runbooks - Document remediation steps for each alert
- Silence during maintenance - Use silences for planned work
Getting Started
Core Concepts
Deployment
Integration
Operations
Development
Reference