A small CLI that turns a YAML SLO spec into:
- A Prometheus rules file with multi-window multi-burn-rate alerts (the recipe from Google's SRE workbook chapter 5).
- A Grafana dashboard JSON, one row per SLO with SLI, objective, and error-budget remaining panels.
Drop the generated files into your existing Prometheus + Grafana stack and you're done. No daemon to run, no controller to install, no SaaS.
flowchart LR
Y[spec.yml<br/>SLO definitions] --> CLI[slo-toolkit CLI]
CLI --> P[prometheus rules.yml<br/>recording + multi-window burn-rate alerts]
CLI --> G[grafana dashboard.json<br/>SLI / objective / error budget]
P --> PR[Prometheus]
G --> GF[Grafana]
Sloth and Pyrra do similar work as Kubernetes CRDs. This is a CLI: it emits plain YAML+JSON that lives in your service repo and goes through normal code review. No controller to operate, no SaaS account.
service: payments-api
owner: platform-team
slos:
- name: availability
objective: 99.9
window: 30d
sli:
kind: ratio
good_events: 'rate(http_requests_total{job="payments-api",status!~"5.."}[5m])'
valid_events: 'rate(http_requests_total{job="payments-api"}[5m])'
labels:
tier: critical
pagerduty_service: payments
- name: latency-p95-500ms
objective: 99
window: 28d
sli:
kind: latency_threshold
query: 'histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="payments-api"}[5m])))'
threshold_seconds: 0.5pip install -e .
slo-toolkit --spec examples/payments-api.yml --out-dir build/Output:
build/
├── payments-api-slos.yml # drop into prometheus rule_files
└── payments-api-dashboard.json # import into Grafana
Recording rules -- track SLI value and objective for downstream consumers:
- record: slo:sli_value:availability
expr: (rate(...good...) / clamp_min(rate(...valid...), 1))
- record: slo:objective:availability
expr: 99.9Alerting rules -- the multi-window multi-burn-rate pattern:
- alert: SLOFastBurn_availability
expr: burn_rate_over_5m > 14.4 AND burn_rate_over_1h > 14.4
labels: { severity: page, burn_rate: fast }
- alert: SLOSlowBurn_availability
expr: burn_rate_over_30m > 6 AND burn_rate_over_6h > 6
labels: { severity: ticket, burn_rate: slow }The thresholds (14.4 fast, 6 slow) come from the SRE workbook chapter 5 and are encoded as a regression test so they don't drift.
The example output passes promtool check rules cleanly:
$ promtool check rules build/payments-api-slos.yml
SUCCESS: 12 rules foundThat check is part of CI on every push.
- Spec loading: rejects bad windows, out-of-range objectives, invalid names, duplicates, unknown SLI kinds, missing latency thresholds
- Prometheus output: two rule groups per SLO, correct burn-rate thresholds (regression-tested), labels inherited from the spec
- Grafana output: three panels per SLO, unique panel IDs, thresholds reflect the SLI's objective
- CLI: writes both files,
--prom-onlyand--grafana-onlyflags, exit code 2 on bad spec
pytest tests/ -vA few things worth calling out, in case you're considering this for real use.
One config in, two configs out. No state machine, no controller pulling from a CRD. The spec is YAML in your repo. The output is YAML+JSON in your repo. Diffs go through normal review. If the toolkit gets retired, the generated files keep working unchanged.
Multi-window multi-burn-rate, not single-threshold alerts. Single threshold SLO alerts age badly: they spam during minor blips or sleep through real outages. The Google recipe gives you two windows per severity.
Two SLI kinds: ratio + latency_threshold. Covers most of what people actually deploy. Window-based SLIs and bucket-fraction SLIs are not implemented; the SLI dataclass is 30 lines, extend it if needed.
clamp_min(denominator, 1) everywhere. Naive ratio rules divide
by zero when your service has no traffic, and AlertManager treats
the resulting NaN as a flap. Clamp fixes this without distorting
real measurements (1 request vs 1000 requests both yield the right
ratio; 0 requests yields a "perfect" 1.0 instead of an alert storm).
Budget reported in minutes, not just percent. SLO.error_budget_minutes()
returns the absolute downtime budget (e.g. 43.2 minutes for 30d/99.9%).
That is the figure on-callers and incident reviewers think in. "We
have 7 minutes left in this window" reads better than "0.016% remaining".
See docs/PATTERNS.md for ready-to-use SLO
definitions across HTTP availability, latency, background jobs,
webhook delivery, and consumer lag -- with tuning notes for each.
- Burn-rate alert tuning per window length (the workbook has a table for 28d/30d/90d combinations)
- OpenSLO compatibility -- accept the OpenSLO YAML as input too
- Sloth-format export for users already on Sloth
- Datadog monitor JSON output
MIT © 2026 Santiago Arteta
