Skip to content

feat(otel): ClickHouse-first OTEL metrics storage with full-fidelity decode#158

Open
dviejokfs wants to merge 24 commits into
mainfrom
feat/otel-metrics-clickhouse
Open

feat(otel): ClickHouse-first OTEL metrics storage with full-fidelity decode#158
dviejokfs wants to merge 24 commits into
mainfrom
feat/otel-metrics-clickhouse

Conversation

@dviejokfs

Copy link
Copy Markdown
Contributor

Summary

Adds a ClickHouse-first, full-fidelity OpenTelemetry metrics path to temps-otel, alongside the existing TimescaleDB one. When TEMPS_CLICKHOUSE_* is configured, OTLP metrics are decoded losslessly and stored in a native metrics ReplacingMergeTree; queries, list_metric_names, and the anomaly-detector helpers run natively against ClickHouse. The default (no-ClickHouse) install and the service_metrics alerting bridge are unchanged.

Context: the premise was "we support traces/spans but not metrics" — in fact metrics were already wired on TimescaleDB but flattened. This PR is the ClickHouse-first leg (CH is far better suited to high-cardinality metric labels + native quantiles), with TimescaleDB parity deferred.

What's included

  • Decode + types (Phase A): MetricPoint carries temporality, is_monotonic, start_time, exponential-histogram buckets, summary quantiles, exemplars (trace/span), flags, description, and typed labels. Synthetic value=sum/count retained for histograms (the anomaly detector reads it).
  • Storage (Phase B): new migrations/clickhouse/0003_metrics.sql (metrics table) + native store_metrics, replacing the delegate-to-Timescale stubs. ChMetricRow field order is guarded against the RowBinary positional-serialization landmine by a DDL-parsing unit test.
  • Query (Phase C): native query_metrics (time-bucket, label filters, group-by, avg/sum/min/max/count/rate/quantile), parameterized + allowlisted (no injection surface). Store-neutral DTOs frozen so TimescaleDB can later satisfy the same contract.
  • Frontend (Phase D, scaffold): Metrics.tsx / MetricsExplorer.tsx mirroring Traces, wired to the generated SDK (no hand-rolled fetch), routed under the project + a sidebar entry.

Verification (live ClickHouse, not skipped)

cargo test -p temps-otel: 263 lib + 4 CH-storage + 1 decode→store fidelity + 7 e2e + 15 TimescaleDB — all pass. The ClickHouse tests genuinely start a container and assert store→query→read-back of temporality, is_monotonic, histogram buckets, labels, and multi-series grouping. cargo check --lib -p temps-otel -p temps-migrations is clean.

Bugs found by the live tests + adversarial review (all fixed)

  • Data loss (critical): ReplacingMergeTree ORDER BY excluded labels and was second-resolution → distinct label-series (e.g. http.method=GET vs POST) silently collapsed into one row. Fixed via a MATERIALIZED label-hash + full timestamp in ORDER BY.
  • Query type error: toUnixTimestamp64Milli(toStartOfInterval(...)) (DateTime, not DateTime64). Fixed to toInt64(toUnixTimestamp(...))*1000.
  • RowBinary read mismatch: min/max/sum/quantile over Nullable(Float64) → wrapped in assumeNotNull.
  • Test infra: the ClickHouse testcontainer wait strategy (message_on_stdout) never matched and the tests silently skipped (false green); switched to an HTTP /ping wait (http_wait_plain) + non-empty CH password.

Deferred (tracked follow-ups, not in this PR)

  • rate() does not branch delta-vs-cumulative (treats all as max−min).
  • Histogram quantiles currently return the percentile of synthetic per-point means (histogram_summary = None) — misleading; no caller warning yet.
  • Exp-histogram / summary / exemplar Array(Tuple(...)) columns are only unit-tested, never inserted into live ClickHouse.
  • Frontend needs an SDK regen against a running server for the richer query params + exemplar→trace links; Observe metric kind is off until then.
  • Phase E — TimescaleDB parity for the new fidelity (default-install path).

Note

A pre-existing sibling has the same silent-skip testcontainer wait bug: temps-analytics-events/src/services/clickhouse_backend.rs — its ClickHouse integration tests have also never actually run. Out of scope here; fix tracked separately.

🤖 Generated with Claude Code

dviejokfs added 13 commits June 26, 2026 03:54
…decode

Add a native ClickHouse metrics path alongside the existing TimescaleDB one.
When TEMPS_CLICKHOUSE_* is configured, OTLP metrics are decoded at full
fidelity (temporality, monotonicity, explicit/exponential histograms,
summaries, exemplars, typed labels) and stored in a native `metrics`
MergeTree; query_metrics/list_metric_names and the anomaly-detector helpers
run natively against it. The service_metrics alerting bridge and the default
(no-ClickHouse) TimescaleDB path are unchanged.

Proven end-to-end against a live ClickHouse testcontainer: store -> query ->
read-back round-trips with fidelity (4 CH storage tests + 1 decode->store
fidelity test + 263 lib unit tests + 7 e2e + 15 timescale, all green).

Correctness fixes uncovered by the live integration tests:
- ReplacingMergeTree ORDER BY now includes a MATERIALIZED label fingerprint
  (sipHash64 of the sorted label set) and a full-precision timestamp.
  Previously distinct label-series sharing a coarse per-second timestamp
  silently collapsed into one row (data loss across series).
- Bucket expression uses toInt64(toUnixTimestamp(...))*1000; the prior
  toUnixTimestamp64Milli() over a DateTime was an illegal-type error.
- Aggregates wrap value in assumeNotNull so Nullable(Float64) results match
  the f64 row read (RowBinary type-width mismatch otherwise).

Test infrastructure: fix the ClickHouse testcontainer wait strategy
(HTTP /ping via the http_wait_plain feature) and credentials so the
integration tests actually execute instead of silently skipping.

Frontend: scaffold a metrics explorer page (mirrors Traces) wired to the
generated SDK; the richer Phase C query params and the Observe metric kind
await an SDK regen against a running server.

Deferred follow-ups: rate() delta-vs-cumulative handling, histogram quantile
reconstruction, live-CH coverage for exp-histogram/summary/exemplar columns,
and TimescaleDB parity for the new fidelity.
- rate() now honours aggregation temporality: DELTA series sum their
  per-interval increments while CUMULATIVE counters use the within-bucket
  (max - min). Previously every series used max-min, which undercounts the
  rate of delta-temporality counters.
- query_metrics now populates histogram_summary with the explicit bucket
  layout (bounds + element-wise-summed bucket_counts) alongside
  count/sum/min/max, so histogram metrics return their real distribution
  instead of only a misleading synthetic mean. This enables correct quantile
  reconstruction from the buckets.

Both behaviours are covered by new live-ClickHouse integration tests
(query_metrics_rate_respects_temporality, histogram_summary_aggregates_buckets).
…plar columns

Adds a decode->store->raw-read test that inserts a metric carrying the nested
Array(Tuple(...)) columns (exponential-histogram bucket counts, summary
quantiles, and exemplars with trace/span ids) into a live ClickHouse container
and asserts they survive. These RowBinary nested-tuple codepaths were
previously only unit-tested at the row-mapping tier, never against a real
ClickHouse server.
…e-count)

histogram_summary previously summed raw histogram_count / bucket_counts across
all rows in a window. For CUMULATIVE histograms re-exported multiple times (the
OTLP default), each export is a running total, so summing them multiplied the
counts by the number of exports — a live demo showed count=300 for 50
observations exported 6 times.

Compute histogram_summary from a per-series sub-aggregation: each series
(attributes_hash) is collapsed first (CUMULATIVE -> latest snapshot via
argMax/max; DELTA or unspecified -> sum across the window), then summed across
series up to the requested grouping granularity, matched back to the scalar
rows by (bucket_ms, series_values). Scalar/quantile aggregations are unchanged.

Covered by a new live-ClickHouse test: cumulative re-exports collapse to
per-series latest, then sum across series.
Regenerate the SDK against the Phase C backend (typed aggregation / metric_type
/ label_filters params; histogram_summary, quantiles, series_key on
MetricBucket).

MetricsExplorer:
- Default view now shows ALL metrics as an overview grid (one mini chart per
  metric with its latest value); click a card to drill into the detailed view.
- Send the real aggregation + label_filters params. For histogram metrics,
  percentiles are computed client-side from the histogram_summary buckets
  (the backend scalar quantile runs over the synthetic mean).
- Add a histogram Distribution panel (count/mean/p50-p99 + per-bucket bars).
- Fix an infinite refetch loop: end_time used new Date() every render, changing
  the query key each render; the time bounds are now memoized.
Per-project saved dashboards persisted in Postgres (metric_dashboards table) as
a typed JSON layout of sections -> metric tiles. Full CRUD under
/api/otel/dashboards following Handler->Service->Data:
- temps-entities: metric_dashboards entity + migration.
- temps-otel: MetricDashboardService (CRUD, typed DashboardLayout/Section/Tile,
  aggregation + size/length-bounds validation), handlers (utoipa,
  permission_guard OtelRead/OtelWrite, audit-logged writes), routes +
  OtelApiDoc registration, DashboardNotFound -> 404.
- get/update/delete are scoped by project_id (defense-in-depth against
  cross-tenant IDOR: a mismatched project_id returns 404, never another
  project's dashboard). A corrupt stored layout is logged, not silently
  swallowed.
- 8 service tests (CRUD + validation + pagination cap).

Frontend: Dashboards list, dashboard view (sections of metric-chart tiles,
reusing the metrics-explorer chart + client-side histogram percentiles, with
memoized time bounds), and a builder (add/rename sections, add tiles by metric
+ aggregation, save). SDK regenerated.

Verified end-to-end against a running server + ClickHouse: CRUD round-trips the
layout losslessly, IDOR is blocked (404 on mismatched project_id), and the UI
renders saved dashboards with live charts.

Follow-ups: domain-prefix the nested layout schema names (OtelDashboardLayout
etc.) and add handler-layer 401/403 tests.
Drop the mx-auto/max-w container constraint on the metrics explorer, dashboards
list, dashboard view, and builder so they use the full content width (e.g.
~1600px instead of a centered 1152px on wide screens) — more columns in the
metric overview grid and more room for tile charts.
Collapse the separate "OTel Metrics" and "Dashboards" sidebar entries into a
single Metrics surface with a route-backed segmented control:
- Explore (index, /metrics): the all-metrics overview + per-metric drill-in.
- Dashboards (/metrics/dashboards/*): the saved-dashboards list/view/builder,
  nested unchanged so their relative navigation is preserved.

/dashboards/* now redirects to /metrics/dashboards. One nav entry; explore and
curate live in the same place.
Add first-class threshold alerting on OpenTelemetry metrics. Alert rules
attach directly to a metric (not to a dashboard), so the metric is the
source of truth and any surface — explorer, dashboards — merely displays them.

Backend (temps-otel):
- metric_alert_rules entity + migration (project-scoped: name, metric_name,
  aggregation, comparator, threshold, window/for-duration, severity, enabled,
  last_state/last_value for firing-state tracking).
- MetricAlertService: project-scoped CRUD with IDOR-safe by-id access
  (get/update/delete 404 on project mismatch), threshold finiteness +
  bounds validation, paginated list.
- MetricAlertEvaluator: background tokio-interval evaluator. Queries the
  latest closed bucket (limit 2 to skip the in-progress one), derives the
  rule value per aggregation (incl. client-side histogram_quantile for
  percentile rules), runs a for-duration state machine, and fires/resolves
  through the existing temps_monitoring AlarmService so alerts reuse
  configured notification channels. No-data ticks preserve prior state.
- CRUD handlers under /otel/alerts with audit logging.

Frontend (web):
- Alerts tab in the unified Metrics surface (Explore | Dashboards | Alerts).
- MetricAlerts list with firing-state badge + one-line rule summary,
  MetricAlertForm create/edit, AlertsRouter.
- Explorer overlays a rule's threshold as a reference line on the metric
  chart (critical=poor tone, else warn).
- SDK regen for the new /otel/alerts endpoints.

Verified end-to-end against a live ClickHouse-backed server: CRUD, IDOR
(get/delete 404 on wrong project), threshold validation, and a breaching
rule transitioning unknown→firing with the alarm actually fired. 18 unit
tests pass; clippy -D warnings clean.
…ules

Reshape metric_alert_rules so future detector families (anomaly, EWMA,
forecast, outlier, Watchdog-style auto-watch) ship as code-only changes —
never another migration. Done now, before the table merges, so there is no
backfill or transition dance.

Schema: replace the static-only `comparator`/`threshold` columns with a
coarse `detection_kind` string discriminator + a typed-in-Rust, jsonb
`detection_config`. The cross-cutting eval envelope (metric, aggregation,
window, for-duration, severity, enabled, last_* state) stays typed columns;
only the detector-specific knobs move into the blob. Folded directly into the
in-flight create migration (zero rows, no ALTER).

detection_config is a serde internally-tagged enum (`DetectionConfig`) in the
new `temps_otel::detectors` module — copied verbatim from the sanctioned
`ProviderConfig`/`revenue_integrations.config` precedent: NO
`#[schema(discriminator)]` (a compile error with serde(tag) in utoipa 5.5.0).
The raw `serde_json::Value` lives only on the sea-orm column; every service
and DTO layer is fully typed, so the generated SDK is a usable TS
discriminated union — `(StaticParams & { kind: 'static' }) | …` — not `any`.

Today only `static` is evaluable; anomaly/forecast/outlier/auto_watch are
typed, schema-present stubs that validate() rejects at create time. Enabling
each later = a new validate arm + evaluator branch + openapi-ts regen.
`detection_kind` is a plain string (not a PG enum) so new kinds need no
ALTER TYPE either.

Evaluator now decodes the typed detector and branches on it (static =
`Comparator::breaches`); the bad-input surface moves to serde (unknown kind /
bad comparator / missing threshold -> 422 at deserialize) which is stronger
than the old string allowlists. Frontend maps the static form to/from
detection_config and the explorer/list narrow on `kind`.

Verified live (ClickHouse-backed): create/get round-trip the typed config
through jsonb; anomaly -> 400 (not yet supported); bad input -> 422; the
static evaluator still fires. 25 unit tests pass; clippy -D warnings clean;
SDK regenerated; frontend typechecks.
Make the `anomaly` detector evaluable end to end — it was a typed,
creation-rejected stub. A rule now learns a baseline band from history and
fires when the current value deviates from it, reusing the same for-duration
state machine and AlarmService as static threshold rules.

Detection math (pure, unit-tested) in `detectors`:
- robust_band = median + MAD·1.4826 (consistent-with-σ scale).
- anomaly_breaches = direction-aware (above/below/both) z-score test, with a
  MIN_BAND_SCALE floor so a flat baseline can't divide by zero.
- season_cell buckets a timestamp into none/hourly/daily/weekly cells.
- validate() now accepts anomaly (robust/basic); agile/ewma and bad
  hyperparameters (deviations≤0, pct∉(0,1], lookback∉1..=90) are rejected.

Evaluator branch (`metric_alert_evaluator`):
- Baseline fetched via the SAME query_metrics aggregation path as the scored
  point (so counter-rate / histogram-percentile compare like-for-like — NOT
  get_metric_baseline, which bypasses aggregation), cached per rule for 1h.
- Seasonal-cell filter with cold-start fallback to the global band; an
  insufficient (<8 samples) or degenerate (flat) baseline PRESERVES state
  rather than firing — no spurious alerts on thin history.
- fire() refactored to a detector-agnostic FireDetails (static vs anomaly
  message/metadata, e.g. "820ms is 4.2σ from the baseline 210 ± 90").
- run_cycle prunes breach-timer + baseline caches for disabled/deleted rules
  (also fixes a pre-existing breach_start leak).

Latent bug fixed (affected static AND anomaly): translate_bucket_interval only
accepted space-separated forms, so the evaluator's `format!("{}s", secs)`
("300s") silently fell back to INTERVAL 1 HOUR — every windowed query was
coarsened to hourly regardless of window_secs. Now also parses the compact
"300s"/"5m"/"1h"/"2d"/"1w" form.

Frontend: the alert form authors anomaly rules — a Detection selector swaps
the static comparator/threshold for algorithm / sensitivity (σ) / direction /
seasonality; the list summary reads the typed config.

Verified live (ClickHouse-backed): anomaly create accepted (was 400);
insufficient baseline preserves state; a normal value stays ok (no false
positive); an injected spike (100000 vs a ~100±15 band) transitions to firing
and raises an alarm. 31 unit tests pass; clippy -D warnings clean; frontend
typechecks; the form renders anomaly fields.
Anomaly detection was exposed but not explained. Three fixes:
- History/eligibility banner: when a metric is picked for an anomaly rule, the
  form checks how much history it has and warns if it's under ~14 days —
  spelling out that the rule will sit at "unknown" and not alert until a
  baseline can be built (the silent-inert trap). The "Unknown" badge now also
  carries a tooltip explaining the same.
- Sensitivity presets: the raw σ number is replaced with High/Medium/Low
  presets (2/3/4σ); the exact σ stays available under "Custom".
- Advanced disclosure: algorithm + seasonality (sensible defaults most users
  won't touch) move behind an "Advanced" details block, leaving Sensitivity +
  Direction as the two primary knobs.
…tion)

Editing an alert rule showed empty Aggregation + Detection selects and fell
back to the static fields, regardless of the saved rule. Root cause: the form
was created with placeholder defaults while the rule loaded, then updated via
react-hook-form's `values` prop — and that reset drops Radix Select values
that *change* during it (same-value selects like severity were unaffected,
masking the bug).

Fix: load the rule in a thin parent, then mount the form body once (keyed on
the rule id) with the resolved `defaultValues` from the start, so no Select is
reset post-mount. Verified: editing the anomaly rule now restores aggregation
(max), detector (anomaly), sensitivity, direction, seasonality.
dviejokfs added 11 commits June 26, 2026 17:11
Add a read-only preview/backtest so an anomaly rule is legible before you save
it (and tunable after). The single highest-value affordance for a feature that
otherwise fails silently.

Backend:
- Extract a shared `BandModel` in `detectors` (per-seasonal-cell robust bands +
  global fallback, built once and queried per timestamp). The evaluator's
  `anomaly_eval` now uses it too, so the preview can never diverge from what
  production would actually do.
- New `services::anomaly_preview` + `POST /otel/alerts/preview`: replays a
  metric over a range against the band and returns per-bucket
  {value, lower, upper, breaching} + breach_count + baseline sufficiency,
  through the SAME query_metrics aggregation path as the evaluator.

Frontend:
- `AnomalyBacktest`: for an anomaly rule, calls the endpoint (debounced) and
  shows "would have fired N× in the last 7 days" plus a chart of the value
  against its expected band with breach markers, or an explicit
  "not enough history" state when the baseline is thin.
- SDK regen for the new endpoint.

Verified live: backtest of the seeded anomaly rule reports 2 breaches over 7d
with band [55.5, 144.5] (median 100 ± 3·MAD); UI renders the count + chart.
31+ unit tests pass (incl. new BandModel test); clippy clean.
For an enabled anomaly rule on the charted metric, backtest its band over the
visible range (the same preview endpoint the form uses) and shade the expected
[lower, upper] region behind the line via a recharts ReferenceArea. Only shown
when the explorer's aggregation matches the rule's (so the band sits on the
same scale as the line) and the baseline is sufficient. ThresholdLineChart
gains an additive `bands` prop; existing threshold lines are unchanged.
Turn the metrics pages from a neutral data browser into a problem surface:
the system finds what's wrong and the user reads the answer.

- Health header pinned above the tabs (Metrics.tsx): triaged status — firing
  alerts + active anomalies worst-first, with Alert/Warn/No-data/OK status
  dots and a firing count on the Alerts tab. Honest coverage states: "all
  systems healthy" vs "nothing is being watched yet" vs "couldn't load" —
  never false-green when nothing is monitored.
- Status dots + toned line on overview cards (MetricsExplorer) and dashboard
  tiles (MetricTile): a firing metric reds out of a wall of green instead of
  looking healthy. Join keyed on (metric_name, aggregation) so a tile only
  reds for a rule that targets the series it shows. Redundant encoding
  (dot + tone + chip), never hue alone.
- Severity sort: overview grid and the alerts list float worst-first
  (alert → warn → no-data → ok), so the 24-tile cap and the alert list stop
  hiding the broken thing.
- Shared alert-status model (one cached listAlerts fetch) reused by all three.

Fixes two real token bugs found in the review:
- Added --success/--warning theme tokens (+ @theme mapping); badge.tsx used
  bg-success/bg-warning with no token, so the "OK"/healthy badge rendered with
  no background.
- AnomalyBacktest used hsl(var(--primary)) — but the tokens are oklch(), so
  the band/line/breach dots painted transparent. Use bare var(--chart-1)/
  var(--destructive) like the working chart.

Frontend-only; typechecks clean.
"Did a deploy cause this?" is the first triage question, and Temps owns the
deploy pipeline — a structural edge over Datadog. Overlay deploy events that
fall inside the chart's visible window as distinct (purple, dashed) vertical
markers, snapped to the nearest bucket (the categorical x-axis can't take a
raw timestamp), labelled with the short commit hash. Scoped to the selected
environment; timestamps normalised (sec or ms). ThresholdLineChart gains an
additive `markers` prop (vertical ReferenceLine); existing lines unchanged.
…(Tier 2)

When a metric looks wrong the next question is "what else moved in this same
window?" — and Temps owns metrics, deploys, traces, and errors, so the answer
shouldn't require re-pivoting four tools by hand. Under the detail chart, a
"related signals" strip:

  - Frames itself by live state: "This metric is firing — see what else
    changed" when a rule on this (metric, aggregation) is firing, else a
    neutral "What changed in this window". Reuses the cached listAlerts via
    useAlertStatus — no extra fetch.
  - Leads with the deploy answer ("1 deploy landed here — marked on the chart
    above", or "No deploys in this window — rules out a release"), the literal
    thing the chart's deploy markers visualise.
  - Deep-links to Traces and Errors pre-scoped to the SAME window: Traces gets
    range+env (which it already honours); Errors learns to read `?range=` so
    the jump actually lands on that window (it widens the metrics-only 6h to
    24h rather than ignore the intent). Plus a "Live view" jump to /observe.

Honesty: every link carries params the target page genuinely applies, and the
strip only states what it knows (deploys) — it never bluffs a trace/error
count it didn't fetch. Verified live: firing CPU-anomaly drill-in shows the
firing header + "1 deploy", and /errors?range=1h lands on "the last hour".

All curated-lucide icons (Network/Bug/Eye/Rocket/ArrowUpRight) confirmed in
the runtime bundle — the subset excludes ListTree/Telescope.
A long project name in the breadcrumb switcher (and long crumbs on deep paths
like errors/<long-title>) wrapped the header to two lines because shadcn's
BreadcrumbList is flex-wrap + break-words. Force the list to flex-nowrap with a
min-w-0 ancestor chain, truncate + responsively cap every crumb (switcher
label, intermediate links, and the current-page crumb — not just the switcher,
or removing the wrap escape-hatch would overflow the terminal crumb under the
action cluster), and clip at the breadcrumb boundary (overflow-hidden) with the
right-hand action cluster pinned shrink-0.

Verified by measurement: with every crumb forced to a 393px label, the
breadcrumb stays 1 line and the header 64px at both 1280px and 375px, each
crumb ellipsizes, and the breadcrumb never overlaps the action cluster.
Make "is anything on fire here?" answerable at a glance. A dashboard's status
is the worst alert-rule status across the metrics its tiles plot — derived from
the same cached listAlerts the per-tile dots already use, so no extra fetch and
the signals can't disagree. The dashboards list shows a pulsing status dot +
"N firing" per row; the dashboard view shows a firing badge in the header and a
per-section count; "All clear" appears only when tiles are actually watched,
nothing when no rule covers the dashboard (no vanity green).

Hardened per adversarial review:
- rollupStatus counts DISTINCT firing rules (Set of rule id), not tiles, so
  three tiles plotting one firing metric read "1 firing", not "3" (the
  name-fallback would otherwise map all three to the same rule).
- ruleStatus + the firing/gathering lists now treat a disabled rule as
  not-firing: the backend freezes a disabled rule's last_state, so without this
  a monitor switched off mid-firing flashed a false red alarm. Fixed at the
  source, so the alerts/health surfaces benefit too.
- the section "N firing" carries a severity title (no color-only meaning).

Verified live: toggling the rule's `enabled` flips the dashboard between
"1 firing" and "All clear" while last_state stays frozen-firing.
The project "Observe" sidebar group mixed OpenTelemetry signals with
operational monitoring and carried a legacy "Metrics" (resource monitoring)
entry that duplicated and added nothing over the OTel Metrics page. Split it:

  OpenTelemetry   Observe · Traces · AI Traces · Metrics · Error Tracking
  Monitoring      Uptime · Request Logs · AI Crawlers

- "OTel Metrics" → "Metrics" (it's the only metrics surface now); "All events"
  → "Observe".
- Removed the legacy project Metrics: dropped the nav entry, the `monitoring`
  route, and the ProjectMonitoring page component (used nowhere else).
- Command palette: repointed its dead "Metrics" → /metrics and added "Observe",
  so removing the route leaves no broken command.
The PR's Changelog Check requires CHANGELOG.md to carry an [Unreleased]
entry. Document the OTel metrics feature set: explorer, dashboards, alert
rules with anomaly detection + backtest, deploy markers, cross-signal links,
Datadog-style firing status, the OpenTelemetry/Monitoring nav grouping, the
one-line header fix, the disabled-rule firing fix, and the legacy Metrics
page removal.
Filtering traces by service threw ClickHouse Code 184 ILLEGAL_AGGREGATION:
the trace-summary SELECT aliases `argMax(service_name, …) AS service_name`,
which shadows the raw column, so an unqualified `service_name = ?` in WHERE
resolved to the aggregate alias. Qualify it as `spans.service_name` so it
binds the per-span column. Fixes both query_trace_summaries (Traces) and
query_genai_trace_summaries (AI Traces); the count mirrors are qualified too
to keep their filter SQL byte-identical. Not the space in the value — verified
against live ClickHouse with "Observability Starter".
An orange line told you a metric was anomalous but the chart didn't show why.
Datadog-style: overlay the detector's time-varying expected-range band and mark
the points that left it.

- ThresholdLineChart gains an optional `bandSeries` (LineChart → ComposedChart):
  two stacked Areas draw the [lower, upper] band behind the line; a stroke-less
  Line with a custom dot marks only the breaching points in red (recharts'
  Scatter plots null points, so it can't mark a sparse subset). Existing
  usages (tiles, web vitals) are unchanged.
- Drill-in: backtest the rule's detector with the DISPLAYED aggregation (not the
  rule's) so the band always tracks the visible line and shows even when you're
  viewing a different aggregation than the rule alerts on. Per-bucket band values
  are merged onto the chart points (nearest-timestamp), replacing the old flat,
  aggregation-gated median band that almost never showed.

Verified live: anomtest.cpu drill-in shows the expected band + a red breach dot
at the spike; non-anomaly metrics and the web-vitals charts render unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant