Skip to content

Latest commit

 

History

History
340 lines (275 loc) · 17.7 KB

File metadata and controls

340 lines (275 loc) · 17.7 KB

RAGProof UI - Design and Delivery Plan

Plan version: 1.0 Date: 2026-07-03 Status: Ready for execution Companion: task IDs P8-xx in TASKS.md

A local-first web dashboard shipped inside the ragproof package and launched with one command:

ragproof ui

It reads the same SQLite store the CLI writes, opens in the browser, and gives teams the three things a terminal table cannot: fast failure triage, visual run comparison, and quality trends over time. The CLI stays the write path and the CI surface; the UI is the analysis surface. That split follows the project's founding principle: CI-native first, the dashboard is a viewer, not the product.


1. Why a UI, and why this shape

The CLI answers "did quality regress". The UI answers the follow-up questions that decide what to do about it: which cases failed, what did the judge actually say, did the reranker change hurt only long questions, is groundedness drifting down across the last twenty runs. Those are exploration tasks, and exploration needs an interactive surface.

Shape decisions, made up front:

  1. Local-first, zero infrastructure. ragproof ui starts a server on localhost and opens the browser, the way promptfoo's viewer does. No accounts, no cloud, no telemetry. It works air-gapped, consistent with the HTML report.
  2. Read-only in v1. Runs are started from the CLI and CI. The UI never mutates the store. This keeps the trust model simple (the store is the single source of truth written by one code path) and keeps scope honest. Triggering runs from the UI is a listed future, not a v1 feature.
  3. Shipped in the wheel, optional install. The built frontend is static files inside the package; the server dependencies live behind pip install 'ragproof[ui]' so the core CLI install stays lean.
  4. One design system, dark mode first-class. Evaluation work happens next to editors and terminals; a dashboard that only looks right in light mode reads as an afterthought.

2. Inspiration audit

What we take from each leading product, and what we deliberately do not.

Product Take Leave
Braintrust The experiments table as the home screen: one row per run, inline score cells with tiny distributions, diff columns against a baseline. The regression-focused compare view where changed cases surface first. Cloud accounts, prompt playground.
LangSmith Dataset-centric drill-down: dataset -> runs over it -> case detail. The case side-panel that keeps the list in view while inspecting one item. Tracing product surface; we evaluate runs, not traces.
Langfuse Score timelines over runs; clean empty states that teach the CLI command that fills the screen. Session/user analytics.
promptfoo The view command UX: one command, browser opens, data is local. Row-per-case matrix with pass/fail cells. Its visual density limits; we want a calmer, more legible grid.
Linear Keyboard-first navigation, command palette, restrained motion, theme quality. The bar for "feels professional". App-specific patterns (issues, cycles).
Grafana Threshold visualization: draw the gate line on the chart so a breach is visible before it is read. Dashboard-builder complexity; our layouts are fixed and opinionated.
Stripe Dashboard Empty, loading, and error states treated as designed screens, not fallbacks. Table typography with tabular numerals. Nothing else; different domain.

3. Architecture

ragproof ui  ->  uvicorn (FastAPI, localhost only)
                   |- /api/*        read-only JSON over the SQLite store
                   |- /*            static SPA bundle (built by Vite, shipped in wheel)

frontend/          React 18 + TypeScript + Vite
                   Tailwind CSS v4 + shadcn/ui (Radix primitives)
                   TanStack Query (data) + TanStack Table (grids)
                   Recharts (charts)  |  self-hosted Inter + JetBrains Mono
  • Backend: FastAPI app in ragproof/ui/server.py, mounted read methods reuse reports/data.py and gate.py exactly as the CLI does, so the UI can never disagree with ragproof gate. Binds 127.0.0.1 by default; --host 0.0.0.0 is an explicit opt-in with a startup warning.
  • Frontend workspace: frontend/ at the repo root, built by CI into ragproof/ui/static/ before packaging. The wheel never requires Node at install or run time.
  • Dev loop: ragproof ui --dev proxies to the Vite dev server for hot reload; production serves the bundle.
  • No external requests at runtime. Fonts and icons are bundled. The CSP is default-src 'self'. Same air-gap guarantee as the HTML report, enforced by a test that scans the built bundle for external URLs.
  • Live runs: v1 polls active runs every 2 seconds (cheap against SQLite). A follow-up can upgrade to SSE without changing the page design.

4. Information architecture

Sidebar (icon + label, collapsible)
  Runs          /                      default screen
  Compare       /compare?base=&cand=
  Trends        /trends
  Datasets      /datasets, /datasets/:id
  Calibration   /calibration
  Docs          external-free link to bundled metric definitions

Run detail      /runs/:id             tabs: Overview | Cases | Gate | Metadata
Case detail     /runs/:id/cases/:key  routed side panel over the Cases tab
Command palette Ctrl/Cmd-K            jump to run, case, screen, or action

Top bar: project switcher (left), global search (center, opens palette), theme toggle and store path indicator (right).

5. Design system

Foundations

Token group Decision
Typeface Inter for UI, JetBrains Mono for ids, hashes, scores in tables, and JSON. Both self-hosted, font-display: swap.
Type scale 12 / 13 / 14 (body) / 16 / 20 / 28. Data-dense screens sit at 13-14px like Linear and Stripe. Tabular numerals (font-variant-numeric: tabular-nums) on every number column.
Spacing 4px base grid. Component paddings from a 4/8/12/16/24/32 ladder only.
Radius 6px controls, 10px cards and panels.
Elevation Borders over shadows. One shadow level for overlays only.
Motion 120-160ms ease-out on open/close, none on data refresh. Charts animate only on first paint.

Color

Neutral scale plus one accent plus four semantic score colors. All pairs pass WCAG AA at their usage sizes in both themes.

Role Light Dark
Background #fafafa #0b0d10
Surface / card #ffffff #14171c
Border #e4e4e7 #262a31
Text primary #18181b #e7e9ee
Text secondary #6b7280 #9aa1ad
Accent (brand) #2563eb #5b8def
Pass #16a34a #4ade80
Warn #d97706 #fbbf24
Fail #dc2626 #f87171
Skip / muted #9ca3af #6b7280

Score cells use a background tint of the semantic color at 10-14% opacity with the full-strength color as text, never solid fills, so tables stay readable.

Data visualization rules

  • Every score chart has a fixed 0 to 1 domain so runs are visually comparable.
  • Gate thresholds render as a labeled horizontal line in the semantic fail color; the Grafana rule: a breach must be visible before it is read.
  • Confidence intervals draw as whiskers on delta bars, never hidden.
  • Distributions are 10-bin histograms, consistent with the HTML report.
  • Skipped is a distinct visual state (hatched or muted), never rendered as 0.

Component inventory (shadcn/ui base, customized)

Button, IconButton, Tabs, Table (virtualized), Badge (status), Tooltip, Dialog, Sheet (case side panel), Command palette, Select, Combobox (run picker), Toast, Skeleton, EmptyState, CodeBlock (JSON with copy), ScoreCell, DeltaCell (arrow + CI whisker), MetricSparkline, RunStatusDot.

6. Screen specifications

6.1 Runs (home)

The Braintrust-style experiments table. One glance answers: what ran, did it pass, what moved.

+--------------------------------------------------------------------------+
| Project: example v          [search runs...]              [theme] [store] |
|--------------------------------------------------------------------------|
| Runs (24)                                    baseline: [latest v] [gate]  |
|--------------------------------------------------------------------------|
| status | label        | started      | cases | ground. | recall | delta  |
| ● pass | reranker-v2  | 2m ago       | 120   | 0.91 ▂▅█ | 0.94  | +0.02 |
| ● fail | prompt-tweak | 1h ago       | 120   | 0.78 ▅▂▁ | 0.93  | -0.13 |
| ◐ run  | nightly      | running 34%  | 41/120|   -      |  -    |   -   |
+--------------------------------------------------------------------------+
  • Columns: status dot (pass/fail from the gate, gray when ungated), label, started (relative), case counts with error badge, one ScoreCell per pinned metric (mean + micro-distribution), delta vs the selected baseline.
  • Pinned metrics default to groundedness, recall@k, injection resistance; the column picker persists per project in localStorage.
  • Row click opens Run detail. Checkbox-select exactly two rows enables the Compare button. Running rows show a progress cell and poll.
  • Empty state: short copy plus the exact command to run (ragproof run --config ragproof.yaml), Langfuse-style.

6.2 Run detail - Overview tab

The HTML report, upgraded to interactive.

  • Header: label, run id (copyable), status, started/duration, judge model and prompt hashes, dataset hash chip linking to the dataset page, cost, cache hit rate.
  • Metric grid: one card per metric with mean, p50/p95, scored/skipped/error counts, 10-bin histogram, and the gate threshold line when configured. Skipped metrics show the reason string, never an empty chart.
  • A "worst cases" strip per judge metric linking into the Cases tab with the filter pre-applied.

6.3 Run detail - Cases tab and case panel

The triage surface, and the screen that sells the product in demos.

+----------------------------------------------------------+---------------+
| filter: [metric v] [status v] [kind v] [sort: worst v]    | case qa-0042  |
|----------------------------------------------------------| question      |
| key     | kind | ground. | recall | inj. | status        | answer        |
| qa-0042 | qa   |  0.33   | 1.00   |  -   | ok            | [context]     |
| qa-0007 | qa   |  0.50   | 0.80   |  -   | ok            | claims:       |
| inj-003 | inj  |   -     |  -     | 0.00 | ok            |  ✓ claim one  |
| qa-0019 | qa   |   -     |  -     |  -   | timeout       |  ✗ claim two  |
+----------------------------------------------------------+ raw judge json|
  • Virtualized table, one row per case, score columns for every metric that scored at least one case. Sort "worst first" per metric is the default entry point from the Overview strip.
  • Selecting a row opens the LangSmith-style side panel (routed, deep-linkable) with: question, answer, retrieved chunks with the cited ones highlighted, and for groundedness the per-claim verdicts rendered as a checklist from judge_raw_json. Raw judge JSON behind a collapsible CodeBlock.
  • j/k moves between cases, Esc closes, arrows work while the panel is open. Failure triage must be a keyboard loop.

6.4 Compare

The regression view: two runs, changed cases first.

  • Header: two run pickers with the mixed-judge guard surfaced as a blocking banner (identical rule to the CLI, including the override).
  • Metric delta table: baseline, candidate, delta with CI whisker, verdict chip. A dataset mismatch renders a persistent warning banner.
  • Case diff grid: rows are cases, cells are per-metric deltas; default filter "changed only", sorted by biggest regression. Clicking opens the case panel split view, baseline answer next to candidate answer.

6.5 Trends

  • One chart per gated metric: mean per run over time, 0 to 1 domain, gate threshold line, points colored by gate verdict, error bars from p50/p95.
  • Range picker (last 10/50/all runs), label filter for run series.
  • Clicking a point navigates to that run.

6.6 Datasets

  • List: name, size, kind breakdown (qa/unanswerable/injection), frozen hash, created, runs count.
  • Detail: generation metadata (model, seed, discard counts, template hashes), case browser reusing the Cases table, list of runs over this dataset.

6.7 Calibration

  • Latest agreement per judge prompt: exact and within-band bars against the thresholds, pass/fail chip per prompt, judge model shown.
  • Empty state teaches ragproof calibrate.

6.8 Cross-cutting states

Every screen defines all four states before implementation: loading (skeletons matching final layout), empty (one sentence plus the CLI command that produces data), error (message plus retry, never a blank panel), and partial (running run, missing judge, skipped metrics). No screen ships without all four.

7. API contract (v1, read-only)

All under /api, JSON, served by the same process. Pydantic response models shared with the CLI's data layer.

GET /api/meta                     store path, schema version, package version
GET /api/projects                 [{id, name, run_count, last_run_at}]
GET /api/runs?project=&limit=&cursor=
                                  paginated run summaries + pinned metric means
GET /api/runs/{ref}               full RunDetail header + metric summaries
GET /api/runs/{ref}/cases?metric=&status=&kind=&sort=&limit=&cursor=
GET /api/runs/{ref}/cases/{key}   full case detail incl. judge_raw
GET /api/runs/{ref}/gate?baseline=
                                  GateOutcome via gate.evaluate_gate
GET /api/compare?baseline=&candidate=&allow_mixed_judges=
GET /api/trends?project=&metric=&limit=
GET /api/datasets                 list with kind breakdown
GET /api/datasets/{id}            metadata + cases + runs
GET /api/calibration              latest stored calibration results, if any

Rules: run refs accept id, unique prefix, and latest, same as the CLI. Errors return {error, code} with the CLI's exit-code taxonomy mapped to HTTP (config 400, not found 404, execution 500). Responses that can exceed a few hundred rows paginate with cursors.

8. Quality bar

  • Accessibility: WCAG AA contrast in both themes, full keyboard operability, focus visible everywhere, prefers-reduced-motion respected, status never conveyed by color alone (dot + label).
  • Performance: first meaningful paint of the Runs screen under 1s against a store with 1,000 runs; case tables virtualized; API list endpoints paginated; bundle under 300KB gzipped, code-split per route.
  • Testing: backend endpoints tested with the same fixtures as the CLI (pytest, no Node needed); frontend components with Vitest + Testing Library; one Playwright smoke that boots ragproof ui against a seeded store and walks Runs -> Run -> Case -> Compare. The no-external-requests test runs against the built bundle in CI.
  • Consistency oath: every number the UI shows comes from the same code path as the CLI (reports/data.py, gate.py). If the UI and ragproof gate ever disagree, that is a release-blocking bug.

9. Delivery phases

Tracked as P8-xx in TASKS.md. Each phase lands green and demoable.

Phase Scope Acceptance
UI-0 Foundation frontend/ workspace (Vite, React, TS, Tailwind, shadcn/ui), design tokens and dark mode, FastAPI server behind ragproof[ui], ragproof ui command, static bundling into the wheel, CI job (typecheck, lint, test, build, bundle-scan) ragproof ui opens a themed shell with live /api/meta data on a machine without Node
UI-1 Runs and Run detail Runs table with pinned ScoreCells and baseline deltas, Run detail Overview and Metadata tabs, polling for running runs, all four states A seeded store renders the home table and run overview matching CLI numbers exactly
UI-2 Case triage Cases tab with virtualized grid, filters, worst-first sort, routed case side panel with claim checklist and cited-chunk highlighting, j/k navigation Triage loop works end to end by keyboard; deep links restore full state
UI-3 Compare and Trends Compare screen with delta table, CI whiskers, changed-cases diff grid and split case view; Trends charts with threshold lines Braintrust-style regression walkthrough works on two seeded runs; mixed-judge guard blocks correctly
UI-4 Gate, Datasets, Calibration Gate tab rendering GateOutcome, Datasets list/detail, Calibration screen, command palette, README screenshots gate verdicts in UI and CLI are identical on the same store; palette jumps everywhere
UI-5 Polish and release A11y audit, performance pass against a 1,000-run store, Playwright smoke in CI, docs (docs/ui.md), screenshots and a GIF for the README Quality bar in section 8 fully met; UI ships in the wheel behind ragproof[ui]

10. Decisions and risks

  • React SPA over server-rendered templates. The case panel, diff grid, palette, and keyboard loops are heavily interactive; a component framework with a mature primitive library (Radix) is the honest cost of "professional grade". The build stays out of the install path by shipping the bundle.
  • Recharts over vendored Chart.js or D3. Declarative, tree-shakeable, fits the fixed-layout charts we need; no runtime CDN, consistent with the air-gap rule.
  • Read-only v1 is a feature, not a gap: it preserves the CI-native trust model. "Run from UI" and SSE live-streaming are listed as post-v1.
  • Risk: frontend toolchain drift on the CI matrix. Mitigation: the Node toolchain runs in exactly one CI job; Python jobs never need it because the bundle is an artifact.
  • Risk: two sources of truth for metrics rendering. Mitigation: the consistency oath above plus a CI test comparing /api/runs/{ref} numbers with CLI --json output on the same store.