Plan version: 1.0
Date: 2026-07-03
Status: Ready for execution
Companion: task IDs P8-xx in TASKS.md
A local-first web dashboard shipped inside the ragproof package and launched with one command:
ragproof ui
It reads the same SQLite store the CLI writes, opens in the browser, and gives teams the three things a terminal table cannot: fast failure triage, visual run comparison, and quality trends over time. The CLI stays the write path and the CI surface; the UI is the analysis surface. That split follows the project's founding principle: CI-native first, the dashboard is a viewer, not the product.
The CLI answers "did quality regress". The UI answers the follow-up questions that decide what to do about it: which cases failed, what did the judge actually say, did the reranker change hurt only long questions, is groundedness drifting down across the last twenty runs. Those are exploration tasks, and exploration needs an interactive surface.
Shape decisions, made up front:
- Local-first, zero infrastructure.
ragproof uistarts a server on localhost and opens the browser, the way promptfoo's viewer does. No accounts, no cloud, no telemetry. It works air-gapped, consistent with the HTML report. - Read-only in v1. Runs are started from the CLI and CI. The UI never mutates the store. This keeps the trust model simple (the store is the single source of truth written by one code path) and keeps scope honest. Triggering runs from the UI is a listed future, not a v1 feature.
- Shipped in the wheel, optional install. The built frontend is static
files inside the package; the server dependencies live behind
pip install 'ragproof[ui]'so the core CLI install stays lean. - One design system, dark mode first-class. Evaluation work happens next to editors and terminals; a dashboard that only looks right in light mode reads as an afterthought.
What we take from each leading product, and what we deliberately do not.
| Product | Take | Leave |
|---|---|---|
| Braintrust | The experiments table as the home screen: one row per run, inline score cells with tiny distributions, diff columns against a baseline. The regression-focused compare view where changed cases surface first. | Cloud accounts, prompt playground. |
| LangSmith | Dataset-centric drill-down: dataset -> runs over it -> case detail. The case side-panel that keeps the list in view while inspecting one item. | Tracing product surface; we evaluate runs, not traces. |
| Langfuse | Score timelines over runs; clean empty states that teach the CLI command that fills the screen. | Session/user analytics. |
| promptfoo | The view command UX: one command, browser opens, data is local. Row-per-case matrix with pass/fail cells. |
Its visual density limits; we want a calmer, more legible grid. |
| Linear | Keyboard-first navigation, command palette, restrained motion, theme quality. The bar for "feels professional". | App-specific patterns (issues, cycles). |
| Grafana | Threshold visualization: draw the gate line on the chart so a breach is visible before it is read. | Dashboard-builder complexity; our layouts are fixed and opinionated. |
| Stripe Dashboard | Empty, loading, and error states treated as designed screens, not fallbacks. Table typography with tabular numerals. | Nothing else; different domain. |
ragproof ui -> uvicorn (FastAPI, localhost only)
|- /api/* read-only JSON over the SQLite store
|- /* static SPA bundle (built by Vite, shipped in wheel)
frontend/ React 18 + TypeScript + Vite
Tailwind CSS v4 + shadcn/ui (Radix primitives)
TanStack Query (data) + TanStack Table (grids)
Recharts (charts) | self-hosted Inter + JetBrains Mono
- Backend: FastAPI app in
ragproof/ui/server.py, mounted read methods reusereports/data.pyandgate.pyexactly as the CLI does, so the UI can never disagree withragproof gate. Binds127.0.0.1by default;--host 0.0.0.0is an explicit opt-in with a startup warning. - Frontend workspace:
frontend/at the repo root, built by CI intoragproof/ui/static/before packaging. The wheel never requires Node at install or run time. - Dev loop:
ragproof ui --devproxies to the Vite dev server for hot reload; production serves the bundle. - No external requests at runtime. Fonts and icons are bundled. The CSP
is
default-src 'self'. Same air-gap guarantee as the HTML report, enforced by a test that scans the built bundle for external URLs. - Live runs: v1 polls active runs every 2 seconds (cheap against SQLite). A follow-up can upgrade to SSE without changing the page design.
Sidebar (icon + label, collapsible)
Runs / default screen
Compare /compare?base=&cand=
Trends /trends
Datasets /datasets, /datasets/:id
Calibration /calibration
Docs external-free link to bundled metric definitions
Run detail /runs/:id tabs: Overview | Cases | Gate | Metadata
Case detail /runs/:id/cases/:key routed side panel over the Cases tab
Command palette Ctrl/Cmd-K jump to run, case, screen, or action
Top bar: project switcher (left), global search (center, opens palette), theme toggle and store path indicator (right).
| Token group | Decision |
|---|---|
| Typeface | Inter for UI, JetBrains Mono for ids, hashes, scores in tables, and JSON. Both self-hosted, font-display: swap. |
| Type scale | 12 / 13 / 14 (body) / 16 / 20 / 28. Data-dense screens sit at 13-14px like Linear and Stripe. Tabular numerals (font-variant-numeric: tabular-nums) on every number column. |
| Spacing | 4px base grid. Component paddings from a 4/8/12/16/24/32 ladder only. |
| Radius | 6px controls, 10px cards and panels. |
| Elevation | Borders over shadows. One shadow level for overlays only. |
| Motion | 120-160ms ease-out on open/close, none on data refresh. Charts animate only on first paint. |
Neutral scale plus one accent plus four semantic score colors. All pairs pass WCAG AA at their usage sizes in both themes.
| Role | Light | Dark |
|---|---|---|
| Background | #fafafa |
#0b0d10 |
| Surface / card | #ffffff |
#14171c |
| Border | #e4e4e7 |
#262a31 |
| Text primary | #18181b |
#e7e9ee |
| Text secondary | #6b7280 |
#9aa1ad |
| Accent (brand) | #2563eb |
#5b8def |
| Pass | #16a34a |
#4ade80 |
| Warn | #d97706 |
#fbbf24 |
| Fail | #dc2626 |
#f87171 |
| Skip / muted | #9ca3af |
#6b7280 |
Score cells use a background tint of the semantic color at 10-14% opacity with the full-strength color as text, never solid fills, so tables stay readable.
- Every score chart has a fixed 0 to 1 domain so runs are visually comparable.
- Gate thresholds render as a labeled horizontal line in the semantic fail color; the Grafana rule: a breach must be visible before it is read.
- Confidence intervals draw as whiskers on delta bars, never hidden.
- Distributions are 10-bin histograms, consistent with the HTML report.
- Skipped is a distinct visual state (hatched or muted), never rendered as 0.
Button, IconButton, Tabs, Table (virtualized), Badge (status), Tooltip, Dialog, Sheet (case side panel), Command palette, Select, Combobox (run picker), Toast, Skeleton, EmptyState, CodeBlock (JSON with copy), ScoreCell, DeltaCell (arrow + CI whisker), MetricSparkline, RunStatusDot.
The Braintrust-style experiments table. One glance answers: what ran, did it pass, what moved.
+--------------------------------------------------------------------------+
| Project: example v [search runs...] [theme] [store] |
|--------------------------------------------------------------------------|
| Runs (24) baseline: [latest v] [gate] |
|--------------------------------------------------------------------------|
| status | label | started | cases | ground. | recall | delta |
| ● pass | reranker-v2 | 2m ago | 120 | 0.91 ▂▅█ | 0.94 | +0.02 |
| ● fail | prompt-tweak | 1h ago | 120 | 0.78 ▅▂▁ | 0.93 | -0.13 |
| ◐ run | nightly | running 34% | 41/120| - | - | - |
+--------------------------------------------------------------------------+
- Columns: status dot (pass/fail from the gate, gray when ungated), label, started (relative), case counts with error badge, one ScoreCell per pinned metric (mean + micro-distribution), delta vs the selected baseline.
- Pinned metrics default to groundedness, recall@k, injection resistance; the column picker persists per project in localStorage.
- Row click opens Run detail. Checkbox-select exactly two rows enables the Compare button. Running rows show a progress cell and poll.
- Empty state: short copy plus the exact command to run
(
ragproof run --config ragproof.yaml), Langfuse-style.
The HTML report, upgraded to interactive.
- Header: label, run id (copyable), status, started/duration, judge model and prompt hashes, dataset hash chip linking to the dataset page, cost, cache hit rate.
- Metric grid: one card per metric with mean, p50/p95, scored/skipped/error counts, 10-bin histogram, and the gate threshold line when configured. Skipped metrics show the reason string, never an empty chart.
- A "worst cases" strip per judge metric linking into the Cases tab with the filter pre-applied.
The triage surface, and the screen that sells the product in demos.
+----------------------------------------------------------+---------------+
| filter: [metric v] [status v] [kind v] [sort: worst v] | case qa-0042 |
|----------------------------------------------------------| question |
| key | kind | ground. | recall | inj. | status | answer |
| qa-0042 | qa | 0.33 | 1.00 | - | ok | [context] |
| qa-0007 | qa | 0.50 | 0.80 | - | ok | claims: |
| inj-003 | inj | - | - | 0.00 | ok | ✓ claim one |
| qa-0019 | qa | - | - | - | timeout | ✗ claim two |
+----------------------------------------------------------+ raw judge json|
- Virtualized table, one row per case, score columns for every metric that scored at least one case. Sort "worst first" per metric is the default entry point from the Overview strip.
- Selecting a row opens the LangSmith-style side panel (routed, deep-linkable)
with: question, answer, retrieved chunks with the cited ones highlighted,
and for groundedness the per-claim verdicts rendered as a checklist from
judge_raw_json. Raw judge JSON behind a collapsible CodeBlock. - j/k moves between cases, Esc closes, arrows work while the panel is open. Failure triage must be a keyboard loop.
The regression view: two runs, changed cases first.
- Header: two run pickers with the mixed-judge guard surfaced as a blocking banner (identical rule to the CLI, including the override).
- Metric delta table: baseline, candidate, delta with CI whisker, verdict chip. A dataset mismatch renders a persistent warning banner.
- Case diff grid: rows are cases, cells are per-metric deltas; default filter "changed only", sorted by biggest regression. Clicking opens the case panel split view, baseline answer next to candidate answer.
- One chart per gated metric: mean per run over time, 0 to 1 domain, gate threshold line, points colored by gate verdict, error bars from p50/p95.
- Range picker (last 10/50/all runs), label filter for run series.
- Clicking a point navigates to that run.
- List: name, size, kind breakdown (qa/unanswerable/injection), frozen hash, created, runs count.
- Detail: generation metadata (model, seed, discard counts, template hashes), case browser reusing the Cases table, list of runs over this dataset.
- Latest agreement per judge prompt: exact and within-band bars against the thresholds, pass/fail chip per prompt, judge model shown.
- Empty state teaches
ragproof calibrate.
Every screen defines all four states before implementation: loading (skeletons matching final layout), empty (one sentence plus the CLI command that produces data), error (message plus retry, never a blank panel), and partial (running run, missing judge, skipped metrics). No screen ships without all four.
All under /api, JSON, served by the same process. Pydantic response models
shared with the CLI's data layer.
GET /api/meta store path, schema version, package version
GET /api/projects [{id, name, run_count, last_run_at}]
GET /api/runs?project=&limit=&cursor=
paginated run summaries + pinned metric means
GET /api/runs/{ref} full RunDetail header + metric summaries
GET /api/runs/{ref}/cases?metric=&status=&kind=&sort=&limit=&cursor=
GET /api/runs/{ref}/cases/{key} full case detail incl. judge_raw
GET /api/runs/{ref}/gate?baseline=
GateOutcome via gate.evaluate_gate
GET /api/compare?baseline=&candidate=&allow_mixed_judges=
GET /api/trends?project=&metric=&limit=
GET /api/datasets list with kind breakdown
GET /api/datasets/{id} metadata + cases + runs
GET /api/calibration latest stored calibration results, if any
Rules: run refs accept id, unique prefix, and latest, same as the CLI.
Errors return {error, code} with the CLI's exit-code taxonomy mapped to
HTTP (config 400, not found 404, execution 500). Responses that can exceed a
few hundred rows paginate with cursors.
- Accessibility: WCAG AA contrast in both themes, full keyboard
operability, focus visible everywhere,
prefers-reduced-motionrespected, status never conveyed by color alone (dot + label). - Performance: first meaningful paint of the Runs screen under 1s against a store with 1,000 runs; case tables virtualized; API list endpoints paginated; bundle under 300KB gzipped, code-split per route.
- Testing: backend endpoints tested with the same fixtures as the CLI
(pytest, no Node needed); frontend components with Vitest + Testing
Library; one Playwright smoke that boots
ragproof uiagainst a seeded store and walks Runs -> Run -> Case -> Compare. The no-external-requests test runs against the built bundle in CI. - Consistency oath: every number the UI shows comes from the same code
path as the CLI (
reports/data.py,gate.py). If the UI andragproof gateever disagree, that is a release-blocking bug.
Tracked as P8-xx in TASKS.md. Each phase lands green and demoable.
| Phase | Scope | Acceptance |
|---|---|---|
| UI-0 Foundation | frontend/ workspace (Vite, React, TS, Tailwind, shadcn/ui), design tokens and dark mode, FastAPI server behind ragproof[ui], ragproof ui command, static bundling into the wheel, CI job (typecheck, lint, test, build, bundle-scan) |
ragproof ui opens a themed shell with live /api/meta data on a machine without Node |
| UI-1 Runs and Run detail | Runs table with pinned ScoreCells and baseline deltas, Run detail Overview and Metadata tabs, polling for running runs, all four states | A seeded store renders the home table and run overview matching CLI numbers exactly |
| UI-2 Case triage | Cases tab with virtualized grid, filters, worst-first sort, routed case side panel with claim checklist and cited-chunk highlighting, j/k navigation | Triage loop works end to end by keyboard; deep links restore full state |
| UI-3 Compare and Trends | Compare screen with delta table, CI whiskers, changed-cases diff grid and split case view; Trends charts with threshold lines | Braintrust-style regression walkthrough works on two seeded runs; mixed-judge guard blocks correctly |
| UI-4 Gate, Datasets, Calibration | Gate tab rendering GateOutcome, Datasets list/detail, Calibration screen, command palette, README screenshots | gate verdicts in UI and CLI are identical on the same store; palette jumps everywhere |
| UI-5 Polish and release | A11y audit, performance pass against a 1,000-run store, Playwright smoke in CI, docs (docs/ui.md), screenshots and a GIF for the README |
Quality bar in section 8 fully met; UI ships in the wheel behind ragproof[ui] |
- React SPA over server-rendered templates. The case panel, diff grid, palette, and keyboard loops are heavily interactive; a component framework with a mature primitive library (Radix) is the honest cost of "professional grade". The build stays out of the install path by shipping the bundle.
- Recharts over vendored Chart.js or D3. Declarative, tree-shakeable, fits the fixed-layout charts we need; no runtime CDN, consistent with the air-gap rule.
- Read-only v1 is a feature, not a gap: it preserves the CI-native trust model. "Run from UI" and SSE live-streaming are listed as post-v1.
- Risk: frontend toolchain drift on the CI matrix. Mitigation: the Node toolchain runs in exactly one CI job; Python jobs never need it because the bundle is an artifact.
- Risk: two sources of truth for metrics rendering. Mitigation: the
consistency oath above plus a CI test comparing
/api/runs/{ref}numbers with CLI--jsonoutput on the same store.