feat(otel): ClickHouse-first OTEL metrics storage with full-fidelity decode by dviejokfs · Pull Request #158 · gotempsh/temps

dviejokfs · 2026-06-26T01:55:37Z

Summary

Adds a ClickHouse-first, full-fidelity OpenTelemetry metrics path to temps-otel, alongside the existing TimescaleDB one. When TEMPS_CLICKHOUSE_* is configured, OTLP metrics are decoded losslessly and stored in a native metrics ReplacingMergeTree; queries, list_metric_names, and the anomaly-detector helpers run natively against ClickHouse. The default (no-ClickHouse) install and the service_metrics alerting bridge are unchanged.

Context: the premise was "we support traces/spans but not metrics" — in fact metrics were already wired on TimescaleDB but flattened. This PR is the ClickHouse-first leg (CH is far better suited to high-cardinality metric labels + native quantiles), with TimescaleDB parity deferred.

What's included

Decode + types (Phase A): MetricPoint carries temporality, is_monotonic, start_time, exponential-histogram buckets, summary quantiles, exemplars (trace/span), flags, description, and typed labels. Synthetic value=sum/count retained for histograms (the anomaly detector reads it).
Storage (Phase B): new migrations/clickhouse/0003_metrics.sql (metrics table) + native store_metrics, replacing the delegate-to-Timescale stubs. ChMetricRow field order is guarded against the RowBinary positional-serialization landmine by a DDL-parsing unit test.
Query (Phase C): native query_metrics (time-bucket, label filters, group-by, avg/sum/min/max/count/rate/quantile), parameterized + allowlisted (no injection surface). Store-neutral DTOs frozen so TimescaleDB can later satisfy the same contract.
Frontend (Phase D, scaffold): Metrics.tsx / MetricsExplorer.tsx mirroring Traces, wired to the generated SDK (no hand-rolled fetch), routed under the project + a sidebar entry.

Verification (live ClickHouse, not skipped)

cargo test -p temps-otel: 263 lib + 4 CH-storage + 1 decode→store fidelity + 7 e2e + 15 TimescaleDB — all pass. The ClickHouse tests genuinely start a container and assert store→query→read-back of temporality, is_monotonic, histogram buckets, labels, and multi-series grouping. cargo check --lib -p temps-otel -p temps-migrations is clean.

Bugs found by the live tests + adversarial review (all fixed)

Data loss (critical): ReplacingMergeTree ORDER BY excluded labels and was second-resolution → distinct label-series (e.g. http.method=GET vs POST) silently collapsed into one row. Fixed via a MATERIALIZED label-hash + full timestamp in ORDER BY.
Query type error: toUnixTimestamp64Milli(toStartOfInterval(...)) (DateTime, not DateTime64). Fixed to toInt64(toUnixTimestamp(...))*1000.
RowBinary read mismatch: min/max/sum/quantile over Nullable(Float64) → wrapped in assumeNotNull.
Test infra: the ClickHouse testcontainer wait strategy (message_on_stdout) never matched and the tests silently skipped (false green); switched to an HTTP /ping wait (http_wait_plain) + non-empty CH password.

Deferred (tracked follow-ups, not in this PR)

rate() does not branch delta-vs-cumulative (treats all as max−min).
Histogram quantiles currently return the percentile of synthetic per-point means (histogram_summary = None) — misleading; no caller warning yet.
Exp-histogram / summary / exemplar Array(Tuple(...)) columns are only unit-tested, never inserted into live ClickHouse.
Frontend needs an SDK regen against a running server for the richer query params + exemplar→trace links; Observe metric kind is off until then.
Phase E — TimescaleDB parity for the new fidelity (default-install path).

Note

A pre-existing sibling has the same silent-skip testcontainer wait bug: temps-analytics-events/src/services/clickhouse_backend.rs — its ClickHouse integration tests have also never actually run. Out of scope here; fix tracked separately.

🤖 Generated with Claude Code

…decode Add a native ClickHouse metrics path alongside the existing TimescaleDB one. When TEMPS_CLICKHOUSE_* is configured, OTLP metrics are decoded at full fidelity (temporality, monotonicity, explicit/exponential histograms, summaries, exemplars, typed labels) and stored in a native `metrics` MergeTree; query_metrics/list_metric_names and the anomaly-detector helpers run natively against it. The service_metrics alerting bridge and the default (no-ClickHouse) TimescaleDB path are unchanged. Proven end-to-end against a live ClickHouse testcontainer: store -> query -> read-back round-trips with fidelity (4 CH storage tests + 1 decode->store fidelity test + 263 lib unit tests + 7 e2e + 15 timescale, all green). Correctness fixes uncovered by the live integration tests: - ReplacingMergeTree ORDER BY now includes a MATERIALIZED label fingerprint (sipHash64 of the sorted label set) and a full-precision timestamp. Previously distinct label-series sharing a coarse per-second timestamp silently collapsed into one row (data loss across series). - Bucket expression uses toInt64(toUnixTimestamp(...))*1000; the prior toUnixTimestamp64Milli() over a DateTime was an illegal-type error. - Aggregates wrap value in assumeNotNull so Nullable(Float64) results match the f64 row read (RowBinary type-width mismatch otherwise). Test infrastructure: fix the ClickHouse testcontainer wait strategy (HTTP /ping via the http_wait_plain feature) and credentials so the integration tests actually execute instead of silently skipping. Frontend: scaffold a metrics explorer page (mirrors Traces) wired to the generated SDK; the richer Phase C query params and the Observe metric kind await an SDK regen against a running server. Deferred follow-ups: rate() delta-vs-cumulative handling, histogram quantile reconstruction, live-CH coverage for exp-histogram/summary/exemplar columns, and TimescaleDB parity for the new fidelity.

- rate() now honours aggregation temporality: DELTA series sum their per-interval increments while CUMULATIVE counters use the within-bucket (max - min). Previously every series used max-min, which undercounts the rate of delta-temporality counters. - query_metrics now populates histogram_summary with the explicit bucket layout (bounds + element-wise-summed bucket_counts) alongside count/sum/min/max, so histogram metrics return their real distribution instead of only a misleading synthetic mean. This enables correct quantile reconstruction from the buckets. Both behaviours are covered by new live-ClickHouse integration tests (query_metrics_rate_respects_temporality, histogram_summary_aggregates_buckets).

…plar columns Adds a decode->store->raw-read test that inserts a metric carrying the nested Array(Tuple(...)) columns (exponential-histogram bucket counts, summary quantiles, and exemplars with trace/span ids) into a live ClickHouse container and asserts they survive. These RowBinary nested-tuple codepaths were previously only unit-tested at the row-mapping tier, never against a real ClickHouse server.

…e-count) histogram_summary previously summed raw histogram_count / bucket_counts across all rows in a window. For CUMULATIVE histograms re-exported multiple times (the OTLP default), each export is a running total, so summing them multiplied the counts by the number of exports — a live demo showed count=300 for 50 observations exported 6 times. Compute histogram_summary from a per-series sub-aggregation: each series (attributes_hash) is collapsed first (CUMULATIVE -> latest snapshot via argMax/max; DELTA or unspecified -> sum across the window), then summed across series up to the requested grouping granularity, matched back to the scalar rows by (bucket_ms, series_values). Scalar/quantile aggregations are unchanged. Covered by a new live-ClickHouse test: cumulative re-exports collapse to per-series latest, then sum across series.

Regenerate the SDK against the Phase C backend (typed aggregation / metric_type / label_filters params; histogram_summary, quantiles, series_key on MetricBucket). MetricsExplorer: - Default view now shows ALL metrics as an overview grid (one mini chart per metric with its latest value); click a card to drill into the detailed view. - Send the real aggregation + label_filters params. For histogram metrics, percentiles are computed client-side from the histogram_summary buckets (the backend scalar quantile runs over the synthetic mean). - Add a histogram Distribution panel (count/mean/p50-p99 + per-bucket bars). - Fix an infinite refetch loop: end_time used new Date() every render, changing the query key each render; the time bounds are now memoized.

Per-project saved dashboards persisted in Postgres (metric_dashboards table) as a typed JSON layout of sections -> metric tiles. Full CRUD under /api/otel/dashboards following Handler->Service->Data: - temps-entities: metric_dashboards entity + migration. - temps-otel: MetricDashboardService (CRUD, typed DashboardLayout/Section/Tile, aggregation + size/length-bounds validation), handlers (utoipa, permission_guard OtelRead/OtelWrite, audit-logged writes), routes + OtelApiDoc registration, DashboardNotFound -> 404. - get/update/delete are scoped by project_id (defense-in-depth against cross-tenant IDOR: a mismatched project_id returns 404, never another project's dashboard). A corrupt stored layout is logged, not silently swallowed. - 8 service tests (CRUD + validation + pagination cap). Frontend: Dashboards list, dashboard view (sections of metric-chart tiles, reusing the metrics-explorer chart + client-side histogram percentiles, with memoized time bounds), and a builder (add/rename sections, add tiles by metric + aggregation, save). SDK regenerated. Verified end-to-end against a running server + ClickHouse: CRUD round-trips the layout losslessly, IDOR is blocked (404 on mismatched project_id), and the UI renders saved dashboards with live charts. Follow-ups: domain-prefix the nested layout schema names (OtelDashboardLayout etc.) and add handler-layer 401/403 tests.

Drop the mx-auto/max-w container constraint on the metrics explorer, dashboards list, dashboard view, and builder so they use the full content width (e.g. ~1600px instead of a centered 1152px on wide screens) — more columns in the metric overview grid and more room for tile charts.

Collapse the separate "OTel Metrics" and "Dashboards" sidebar entries into a single Metrics surface with a route-backed segmented control: - Explore (index, /metrics): the all-metrics overview + per-metric drill-in. - Dashboards (/metrics/dashboards/*): the saved-dashboards list/view/builder, nested unchanged so their relative navigation is preserved. /dashboards/* now redirects to /metrics/dashboards. One nav entry; explore and curate live in the same place.

Add first-class threshold alerting on OpenTelemetry metrics. Alert rules attach directly to a metric (not to a dashboard), so the metric is the source of truth and any surface — explorer, dashboards — merely displays them. Backend (temps-otel): - metric_alert_rules entity + migration (project-scoped: name, metric_name, aggregation, comparator, threshold, window/for-duration, severity, enabled, last_state/last_value for firing-state tracking). - MetricAlertService: project-scoped CRUD with IDOR-safe by-id access (get/update/delete 404 on project mismatch), threshold finiteness + bounds validation, paginated list. - MetricAlertEvaluator: background tokio-interval evaluator. Queries the latest closed bucket (limit 2 to skip the in-progress one), derives the rule value per aggregation (incl. client-side histogram_quantile for percentile rules), runs a for-duration state machine, and fires/resolves through the existing temps_monitoring AlarmService so alerts reuse configured notification channels. No-data ticks preserve prior state. - CRUD handlers under /otel/alerts with audit logging. Frontend (web): - Alerts tab in the unified Metrics surface (Explore | Dashboards | Alerts). - MetricAlerts list with firing-state badge + one-line rule summary, MetricAlertForm create/edit, AlertsRouter. - Explorer overlays a rule's threshold as a reference line on the metric chart (critical=poor tone, else warn). - SDK regen for the new /otel/alerts endpoints. Verified end-to-end against a live ClickHouse-backed server: CRUD, IDOR (get/delete 404 on wrong project), threshold validation, and a breaching rule transitioning unknown→firing with the alarm actually fired. 18 unit tests pass; clippy -D warnings clean.

…ules Reshape metric_alert_rules so future detector families (anomaly, EWMA, forecast, outlier, Watchdog-style auto-watch) ship as code-only changes — never another migration. Done now, before the table merges, so there is no backfill or transition dance. Schema: replace the static-only `comparator`/`threshold` columns with a coarse `detection_kind` string discriminator + a typed-in-Rust, jsonb `detection_config`. The cross-cutting eval envelope (metric, aggregation, window, for-duration, severity, enabled, last_* state) stays typed columns; only the detector-specific knobs move into the blob. Folded directly into the in-flight create migration (zero rows, no ALTER). detection_config is a serde internally-tagged enum (`DetectionConfig`) in the new `temps_otel::detectors` module — copied verbatim from the sanctioned `ProviderConfig`/`revenue_integrations.config` precedent: NO `#[schema(discriminator)]` (a compile error with serde(tag) in utoipa 5.5.0). The raw `serde_json::Value` lives only on the sea-orm column; every service and DTO layer is fully typed, so the generated SDK is a usable TS discriminated union — `(StaticParams & { kind: 'static' }) | …` — not `any`. Today only `static` is evaluable; anomaly/forecast/outlier/auto_watch are typed, schema-present stubs that validate() rejects at create time. Enabling each later = a new validate arm + evaluator branch + openapi-ts regen. `detection_kind` is a plain string (not a PG enum) so new kinds need no ALTER TYPE either. Evaluator now decodes the typed detector and branches on it (static = `Comparator::breaches`); the bad-input surface moves to serde (unknown kind / bad comparator / missing threshold -> 422 at deserialize) which is stronger than the old string allowlists. Frontend maps the static form to/from detection_config and the explorer/list narrow on `kind`. Verified live (ClickHouse-backed): create/get round-trip the typed config through jsonb; anomaly -> 400 (not yet supported); bad input -> 422; the static evaluator still fires. 25 unit tests pass; clippy -D warnings clean; SDK regenerated; frontend typechecks.

Make the `anomaly` detector evaluable end to end — it was a typed, creation-rejected stub. A rule now learns a baseline band from history and fires when the current value deviates from it, reusing the same for-duration state machine and AlarmService as static threshold rules. Detection math (pure, unit-tested) in `detectors`: - robust_band = median + MAD·1.4826 (consistent-with-σ scale). - anomaly_breaches = direction-aware (above/below/both) z-score test, with a MIN_BAND_SCALE floor so a flat baseline can't divide by zero. - season_cell buckets a timestamp into none/hourly/daily/weekly cells. - validate() now accepts anomaly (robust/basic); agile/ewma and bad hyperparameters (deviations≤0, pct∉(0,1], lookback∉1..=90) are rejected. Evaluator branch (`metric_alert_evaluator`): - Baseline fetched via the SAME query_metrics aggregation path as the scored point (so counter-rate / histogram-percentile compare like-for-like — NOT get_metric_baseline, which bypasses aggregation), cached per rule for 1h. - Seasonal-cell filter with cold-start fallback to the global band; an insufficient (<8 samples) or degenerate (flat) baseline PRESERVES state rather than firing — no spurious alerts on thin history. - fire() refactored to a detector-agnostic FireDetails (static vs anomaly message/metadata, e.g. "820ms is 4.2σ from the baseline 210 ± 90"). - run_cycle prunes breach-timer + baseline caches for disabled/deleted rules (also fixes a pre-existing breach_start leak). Latent bug fixed (affected static AND anomaly): translate_bucket_interval only accepted space-separated forms, so the evaluator's `format!("{}s", secs)` ("300s") silently fell back to INTERVAL 1 HOUR — every windowed query was coarsened to hourly regardless of window_secs. Now also parses the compact "300s"/"5m"/"1h"/"2d"/"1w" form. Frontend: the alert form authors anomaly rules — a Detection selector swaps the static comparator/threshold for algorithm / sensitivity (σ) / direction / seasonality; the list summary reads the typed config. Verified live (ClickHouse-backed): anomaly create accepted (was 400); insufficient baseline preserves state; a normal value stays ok (no false positive); an injected spike (100000 vs a ~100±15 band) transitions to firing and raises an alarm. 31 unit tests pass; clippy -D warnings clean; frontend typechecks; the form renders anomaly fields.

Anomaly detection was exposed but not explained. Three fixes: - History/eligibility banner: when a metric is picked for an anomaly rule, the form checks how much history it has and warns if it's under ~14 days — spelling out that the rule will sit at "unknown" and not alert until a baseline can be built (the silent-inert trap). The "Unknown" badge now also carries a tooltip explaining the same. - Sensitivity presets: the raw σ number is replaced with High/Medium/Low presets (2/3/4σ); the exact σ stays available under "Custom". - Advanced disclosure: algorithm + seasonality (sensible defaults most users won't touch) move behind an "Advanced" details block, leaving Sensitivity + Direction as the two primary knobs.

…tion) Editing an alert rule showed empty Aggregation + Detection selects and fell back to the static fields, regardless of the saved rule. Root cause: the form was created with placeholder defaults while the rule loaded, then updated via react-hook-form's `values` prop — and that reset drops Radix Select values that *change* during it (same-value selects like severity were unaffected, masking the bug). Fix: load the rule in a thin parent, then mount the form body once (keyed on the rule id) with the resolved `defaultValues` from the start, so no Select is reset post-mount. Verified: editing the anomaly rule now restores aggregation (max), detector (anomaly), sensitivity, direction, seasonality.

Add a read-only preview/backtest so an anomaly rule is legible before you save it (and tunable after). The single highest-value affordance for a feature that otherwise fails silently. Backend: - Extract a shared `BandModel` in `detectors` (per-seasonal-cell robust bands + global fallback, built once and queried per timestamp). The evaluator's `anomaly_eval` now uses it too, so the preview can never diverge from what production would actually do. - New `services::anomaly_preview` + `POST /otel/alerts/preview`: replays a metric over a range against the band and returns per-bucket {value, lower, upper, breaching} + breach_count + baseline sufficiency, through the SAME query_metrics aggregation path as the evaluator. Frontend: - `AnomalyBacktest`: for an anomaly rule, calls the endpoint (debounced) and shows "would have fired N× in the last 7 days" plus a chart of the value against its expected band with breach markers, or an explicit "not enough history" state when the baseline is thin. - SDK regen for the new endpoint. Verified live: backtest of the seeded anomaly rule reports 2 breaches over 7d with band [55.5, 144.5] (median 100 ± 3·MAD); UI renders the count + chart. 31+ unit tests pass (incl. new BandModel test); clippy clean.

For an enabled anomaly rule on the charted metric, backtest its band over the visible range (the same preview endpoint the form uses) and shade the expected [lower, upper] region behind the line via a recharts ReferenceArea. Only shown when the explorer's aggregation matches the rule's (so the band sits on the same scale as the line) and the baseline is sufficient. ThresholdLineChart gains an additive `bands` prop; existing threshold lines are unchanged.

@theme

Turn the metrics pages from a neutral data browser into a problem surface: the system finds what's wrong and the user reads the answer. - Health header pinned above the tabs (Metrics.tsx): triaged status — firing alerts + active anomalies worst-first, with Alert/Warn/No-data/OK status dots and a firing count on the Alerts tab. Honest coverage states: "all systems healthy" vs "nothing is being watched yet" vs "couldn't load" — never false-green when nothing is monitored. - Status dots + toned line on overview cards (MetricsExplorer) and dashboard tiles (MetricTile): a firing metric reds out of a wall of green instead of looking healthy. Join keyed on (metric_name, aggregation) so a tile only reds for a rule that targets the series it shows. Redundant encoding (dot + tone + chip), never hue alone. - Severity sort: overview grid and the alerts list float worst-first (alert → warn → no-data → ok), so the 24-tile cap and the alert list stop hiding the broken thing. - Shared alert-status model (one cached listAlerts fetch) reused by all three. Fixes two real token bugs found in the review: - Added --success/--warning theme tokens (+ @theme mapping); badge.tsx used bg-success/bg-warning with no token, so the "OK"/healthy badge rendered with no background. - AnomalyBacktest used hsl(var(--primary)) — but the tokens are oklch(), so the band/line/breach dots painted transparent. Use bare var(--chart-1)/ var(--destructive) like the working chart. Frontend-only; typechecks clean.

"Did a deploy cause this?" is the first triage question, and Temps owns the deploy pipeline — a structural edge over Datadog. Overlay deploy events that fall inside the chart's visible window as distinct (purple, dashed) vertical markers, snapped to the nearest bucket (the categorical x-axis can't take a raw timestamp), labelled with the short commit hash. Scoped to the selected environment; timestamps normalised (sec or ms). ThresholdLineChart gains an additive `markers` prop (vertical ReferenceLine); existing lines unchanged.

…(Tier 2) When a metric looks wrong the next question is "what else moved in this same window?" — and Temps owns metrics, deploys, traces, and errors, so the answer shouldn't require re-pivoting four tools by hand. Under the detail chart, a "related signals" strip: - Frames itself by live state: "This metric is firing — see what else changed" when a rule on this (metric, aggregation) is firing, else a neutral "What changed in this window". Reuses the cached listAlerts via useAlertStatus — no extra fetch. - Leads with the deploy answer ("1 deploy landed here — marked on the chart above", or "No deploys in this window — rules out a release"), the literal thing the chart's deploy markers visualise. - Deep-links to Traces and Errors pre-scoped to the SAME window: Traces gets range+env (which it already honours); Errors learns to read `?range=` so the jump actually lands on that window (it widens the metrics-only 6h to 24h rather than ignore the intent). Plus a "Live view" jump to /observe. Honesty: every link carries params the target page genuinely applies, and the strip only states what it knows (deploys) — it never bluffs a trace/error count it didn't fetch. Verified live: firing CPU-anomaly drill-in shows the firing header + "1 deploy", and /errors?range=1h lands on "the last hour". All curated-lucide icons (Network/Bug/Eye/Rocket/ArrowUpRight) confirmed in the runtime bundle — the subset excludes ListTree/Telescope.

A long project name in the breadcrumb switcher (and long crumbs on deep paths like errors/<long-title>) wrapped the header to two lines because shadcn's BreadcrumbList is flex-wrap + break-words. Force the list to flex-nowrap with a min-w-0 ancestor chain, truncate + responsively cap every crumb (switcher label, intermediate links, and the current-page crumb — not just the switcher, or removing the wrap escape-hatch would overflow the terminal crumb under the action cluster), and clip at the breadcrumb boundary (overflow-hidden) with the right-hand action cluster pinned shrink-0. Verified by measurement: with every crumb forced to a 393px label, the breadcrumb stays 1 line and the header 64px at both 1280px and 375px, each crumb ellipsizes, and the breadcrumb never overlaps the action cluster.

Make "is anything on fire here?" answerable at a glance. A dashboard's status is the worst alert-rule status across the metrics its tiles plot — derived from the same cached listAlerts the per-tile dots already use, so no extra fetch and the signals can't disagree. The dashboards list shows a pulsing status dot + "N firing" per row; the dashboard view shows a firing badge in the header and a per-section count; "All clear" appears only when tiles are actually watched, nothing when no rule covers the dashboard (no vanity green). Hardened per adversarial review: - rollupStatus counts DISTINCT firing rules (Set of rule id), not tiles, so three tiles plotting one firing metric read "1 firing", not "3" (the name-fallback would otherwise map all three to the same rule). - ruleStatus + the firing/gathering lists now treat a disabled rule as not-firing: the backend freezes a disabled rule's last_state, so without this a monitor switched off mid-firing flashed a false red alarm. Fixed at the source, so the alerts/health surfaces benefit too. - the section "N firing" carries a severity title (no color-only meaning). Verified live: toggling the rule's `enabled` flips the dashboard between "1 firing" and "All clear" while last_state stays frozen-firing.

The project "Observe" sidebar group mixed OpenTelemetry signals with operational monitoring and carried a legacy "Metrics" (resource monitoring) entry that duplicated and added nothing over the OTel Metrics page. Split it: OpenTelemetry Observe · Traces · AI Traces · Metrics · Error Tracking Monitoring Uptime · Request Logs · AI Crawlers - "OTel Metrics" → "Metrics" (it's the only metrics surface now); "All events" → "Observe". - Removed the legacy project Metrics: dropped the nav entry, the `monitoring` route, and the ProjectMonitoring page component (used nowhere else). - Command palette: repointed its dead "Metrics" → /metrics and added "Observe", so removing the route leaves no broken command.

The PR's Changelog Check requires CHANGELOG.md to carry an [Unreleased] entry. Document the OTel metrics feature set: explorer, dashboards, alert rules with anomaly detection + backtest, deploy markers, cross-signal links, Datadog-style firing status, the OpenTelemetry/Monitoring nav grouping, the one-line header fix, the disabled-rule firing fix, and the legacy Metrics page removal.

Filtering traces by service threw ClickHouse Code 184 ILLEGAL_AGGREGATION: the trace-summary SELECT aliases `argMax(service_name, …) AS service_name`, which shadows the raw column, so an unqualified `service_name = ?` in WHERE resolved to the aggregate alias. Qualify it as `spans.service_name` so it binds the per-span column. Fixes both query_trace_summaries (Traces) and query_genai_trace_summaries (AI Traces); the count mirrors are qualified too to keep their filter SQL byte-identical. Not the space in the value — verified against live ClickHouse with "Observability Starter".

An orange line told you a metric was anomalous but the chart didn't show why. Datadog-style: overlay the detector's time-varying expected-range band and mark the points that left it. - ThresholdLineChart gains an optional `bandSeries` (LineChart → ComposedChart): two stacked Areas draw the [lower, upper] band behind the line; a stroke-less Line with a custom dot marks only the breaching points in red (recharts' Scatter plots null points, so it can't mark a sparse subset). Existing usages (tiles, web vitals) are unchanged. - Drill-in: backtest the rule's detector with the DISPLAYED aggregation (not the rule's) so the band always tracks the visible line and shows even when you're viewing a different aggregation than the rule alerts on. Per-bucket band values are merged onto the chart points (nearest-timestamp), replacing the old flat, aggregation-gated median band that almost never showed. Verified live: anomtest.cpu drill-in shows the expected band + a red breach dot at the spike; non-anomaly metrics and the web-vitals charts render unchanged.

dviejokfs added 13 commits June 26, 2026 03:54

dviejokfs mentioned this pull request Jun 26, 2026

fix(providers): harden postgres major upgrades #151

Open

dviejokfs added 11 commits June 26, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(otel): ClickHouse-first OTEL metrics storage with full-fidelity decode#158

feat(otel): ClickHouse-first OTEL metrics storage with full-fidelity decode#158
dviejokfs wants to merge 24 commits into
mainfrom
feat/otel-metrics-clickhouse

dviejokfs commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dviejokfs commented Jun 26, 2026

Summary

What's included

Verification (live ClickHouse, not skipped)

Bugs found by the live tests + adversarial review (all fixed)

Deferred (tracked follow-ups, not in this PR)

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant