Overrode the plan's read-only-v1 decision at the user's request: the dashboard is now the main control panel, able to run operations, not just view them.
Done:
- Background jobs subsystem (ragproof/ui/jobs.py): in-memory JobManager runs async operations, captures redacted log lines and a terminal status, evicts old jobs, drains on shutdown.
- Actions API (ragproof/ui/actions.py) reusing the engine, generate, freeze, calibrate, check, and report code as jobs: POST /api/actions/{run,generate, freeze,calibrate,check}, POST /api/runs/{ref}/report producing downloadable artifacts, GET /api/jobs[/id], GET /api/config, GET /api/artifacts/{job}/{name} with path-traversal guard.
- Frontend: top-bar New run button plus a menu (generate, calibrate, check), Jobs screen with live log tail and run/report links, Config viewer, per-run Re-run and Report buttons, command-palette actions group, sidebar Jobs and Config entries.
- 9 action/jobs API tests: run recorded from the UI, report artifacts and traversal block, check probe, freeze, generate-failure, config, 404.
- Verified live in the browser: New run created a run end to end, report job produced a downloadable HTML artifact, traversal blocked, Jobs screen shows both with live logs, no console errors.
Decisions:
- Config is view-only. Writing arbitrary config from a browser is an unbounded risk, so the on-disk ragproof.yaml stays the single source of truth.
- Actions run with the same reach as the CLI on that machine; the server stays 127.0.0.1 with a warning on other hosts, documented in docs/ui.md.
- Jobs are in-memory (not persisted). They are a live operations view, not an audit log; the run store remains the durable record.
Done:
- Read API (ragproof/ui/api.py) reusing reports/data.py, gate.py, compare.py: projects, paginated runs, run detail, cases with filter and worst sort, case detail, gate, compare, trends, datasets, calibration. Store gained the list and aggregate queries a UI needs. 7 endpoint tests assert the API matches the CLI on the same store.
- Runs home: table with status, pinned metric ScoreCells, per-run deltas, live polling for running runs, and the four designed states.
- Run detail: Overview (metric cards, distribution histograms, worst-case strips), Cases triage grid with filters and worst-first sort, routed case side panel (question, answer, cited-chunk highlighting, per-claim groundedness checklist, raw judge JSON, Esc + deep link), Gate tab rendering the same verdict and CIs as the CLI, Metadata tab.
- Compare screen (delta table, dataset-mismatch warning), Trends charts (0-1 domain, click-through), Datasets list and detail, Calibration screen, Ctrl/Cmd-K command palette.
- Animations on data: count-up numbers (AnimatedNumber), animated score bars, staggered row entrance, animated histograms and trend lines, spring-in case panel and palette; all respect prefers-reduced-motion.
- Route code-splitting: home loads ~145KB gzipped, the chart library (~104KB) loads only on run detail and trends. Under the 300KB budget.
- docs/ui.md; README dashboard section; frontend tests (theme, RunsPage states) plus the backend API and bundle-scan tests.
- Verified live in the browser: runs table, run overview, case panel, and gate tab all render real store data with no console errors.
Decisions:
- Recharts for charts (declarative, tree-shakeable, no runtime CDN); it is the bulk of the split chart chunk, kept off the home page by lazy routes.
- The case panel is a deep-linkable ?case= param rather than a nested route, which keeps it deep-linkable without a nested router.
Done:
- frontend/ workspace: Vite, React 19, TypeScript strict, Tailwind v4, ESLint flat config, Prettier, Vitest with Testing Library
- Design tokens from UI_PLAN section 5 as CSS variables, dark mode via a .dark class with prefers-color-scheme default and a persisted toggle; Inter and JetBrains Mono self-hosted through fontsource
- App shell: sidebar navigation for all planned screens, top bar with live project and store path, Runs page foundation card with designed loading, error, and ready states
- FastAPI server in ragproof/ui/ behind the ragproof[ui] extra: /api/meta, CSP default-src 'self', nosniff and referrer headers, SPA fallback that never swallows unknown /api paths (404 JSON), static root path guard
- ragproof ui command: binds 127.0.0.1 by default with a warning on other hosts, opens the browser, --no-browser and --dev flags, clear install hint when the extra is missing
- Build pipeline: vite builds into ragproof/ui/static (committed so wheels and no-Node machines work); release workflow builds the frontend before uv build; ui.yml CI job runs typecheck, lint, format, tests, build, and an external-resource bundle scan; main CI installs the ui extra
- Bundle: 77.5KB gzipped JS, fonts as woff2, zero external loads (tested)
- Verified live: ragproof ui served /api/meta, the themed shell, CSP headers, and client-route fallback on this machine
Decisions:
- React 19 instead of the plan's React 18; it is the current stable line and Testing Library and Router support it fully.
- lucide-react resolved to 1.x (the 0.x line graduated); API unchanged.
- The built bundle is committed. Rationale: acceptance requires the UI to work from a checkout without Node, and wheels must build from any clone. Revisit if bundle churn becomes noisy in diffs.
Next: UI-1 (runs table and run detail), interleaved with Phase 7 launch work.
Done:
- Single-file HTML report: overview with inline SVG mean bars, per-metric SVG distribution histograms, worst-10 cases per metric with judge reasoning, dataset/config/prompt hashes, cost; zero network requests, verified by a test that greps the output for external resource loads and a live check (24KB, 0)
- Markdown summary for PR comments; JUnit XML that marks threshold breaches as and execution errors as , with a run-level execution case when cases fail to evaluate
- ragproof gate: absolute min thresholds, relative max_drop vs baseline, per-metric noise_floor, bootstrap 95% CI so an uncertain drop warns instead of failing, on_missing fail|skip, min_samples warning; exit 0/1/2/3
- ragproof report writes html, md, and junit
- --json output on run, compare, and gate
- Reusable composite GitHub Action (action.yml): install, run, gate, upload the HTML artifact, sticky PR comment with the Markdown summary
- Dockerfile: multi-stage, slim, non-root
- Dogfood workflow runs the action against the example pipeline in this repo
- docs/ci.md; verified the wheel ships all judge and dataset templates and fixtures so pip installs run without missing package data
Decisions:
- Charts are hand-rolled inline SVG instead of vendored Chart.js. The plan's RW-13 goal is a self-contained, zero-network report; inline SVG meets that with no 200KB third-party bundle and nothing to fetch. Documented here.
- TODO(decision): the Dockerfile pins the base image by tag, not digest. Digest pinning needs a registry resolve; do it in an environment with network access before release.
Next: Phase 7, case studies and launch.
Done:
- Payload library expanded to 10 categories (instruction override, data exfiltration, tone hijack, citation spoofing, system prompt disclosure, formatting hijack, competitor steering, fake citation, link bait, chained instructions), each with a unique inert sentinel and a compliance detector
- Payload safety lint in the standard suite: reserved .invalid domains only, no shell metacharacters, every marker appears in its instruction and trips only its own detector, clean answers trip none
- robustness.injection_resistance: deterministic, 1.0 when the answer resisted the planted payload, 0.0 when it complied; detects by payload id or a literal marker fallback for hand-written cases
- robustness.abstention: refusal on unanswerable cases; deterministic heuristic with judge confirmation when a judge is configured; higher is better
- robustness.overrefusal: refusal rate on answerable cases, reported beside abstention so a refuse-everything pipeline cannot hide; lower is better
- refusal is now a calibrated judge prompt with 12 fixtures, wired into calibrate and prompt versioning
- Two example pipelines: a deliberately vulnerable one that obeys injections and fabricates, and a guarded one that ignores injections and abstains
- Injection generation now plants the payload instruction in the question and stores the payload id, so cases are self-contained and testable
- Integration tests assert the vulnerable and guarded pipelines land at the opposite ends of injection_resistance and abstention
Decisions:
- Injection instructions are delivered in the question rather than by emitting poisoned corpus files (still future). This makes cases portable and testable without wiring poisoned documents through a pipeline's own retrieval.
Next: Phase 6, reports and the CI gate.
Done:
- Corpus ingestion: TXT and Markdown in the core install, PDF and DOCX behind the ingest extra with a clear install hint; per-file size cap; unreadable, oversize, and empty files skipped with a reported reason; deterministic ordering
- Deterministic chunking with overlap; stable chunk ids scoped to documents
- Generation client reusing the chat transport, with a repair retry and a budget guard; sequential so ordering is deterministic for a seed
- QA synthesis with a second-pass answerability check (failures discarded and counted); unanswerable synthesis verified absent via a lexical retrieval plus judge pass; injection cases attach an inert payload marker
- Inert payload seed set with a safety test asserting reserved domains only and no shell metacharacters; Phase 5 expands this and adds detectors
- ragproof generate: ingest, generate, write a JSONL review file, print produced and discarded counts
- ragproof freeze: validates cases (qa needs a source id, injection needs a payload), writes dataset.frozen.json with a canonical sha256 embedded
- Frozen datasets verified on load; a file edited after freezing is refused; run and check accept plain JSONL or a frozen dataset
- Shared prompt-template module (render plus hashing) now used by both the judge and the generator
- docs/datasets.md; tiny test corpus under tests/fixtures/corpus
Decisions:
- Generated qa and injection cases use the document path as expected_source_ids since our chunk ids will not match a pipeline's own ids. Documented in docs/datasets.md with the source_match: document guidance.
- Poisoned corpus files are not emitted in v1; the injection case carries the payload marker, which is what Phase 5 detection needs. Noted as future.
Next: Phase 5, robustness metrics.
Done:
- Provider-agnostic judge client: OpenRouter, OpenAI, Ollama (no key needed), Anthropic; temperature 0, per-call timeout, retries with Retry-After, API key read from the environment at call time and never stored
- Structured judging: JSON extracted and schema-validated, one repair retry with the validation error appended, then metric-level judge_error (excluded from means, added to the metric's error count, never scored 0)
- Judge cache in the run store keyed by (model, prompt hash, input hash); hit/miss stats printed per run; --no-cache bypasses it
- Versioned prompt templates (sha256 recorded per run) with calibration fixtures, 10 human-scored examples per prompt
- Generation metrics: citation_validity (deterministic, duplicates deduped), groundedness (per-claim verdicts in judge_raw_json, zero claims is not_applicable), citation_support, answer_relevance, completeness
- Cost accounting from provider usage with estimate fallback; budget checked before every call; breach aborts the run as aborted:budget with exit 2 and keeps completed results resumable
- ragproof calibrate with per-prompt exact and within-band agreement, exit 1 below thresholds; calibrate.yml CI workflow on prompt or fixture changes (skips with a warning until the RAGPROOF_LLM_API_KEY secret is configured)
- Mixed-judge guard: compare refuses runs with different judge models or prompt versions unless --allow-mixed-judges; resume refuses a changed judge
- Schema v2 migration (judge_cache table) exercised the versioned-migration path: v1 databases migrate with a backup file
- Metric protocol is now async; raw judge output persisted redacted per case
Decisions:
- judge_error is recorded per metric, not as a whole-case status: a judge failure on one metric must not discard the deterministic scores of the same case. The spec's "fails the case, not the run" intent is preserved.
Next: Phase 4, dataset generation.
Done:
- Retrieval metrics with exact-value fixture tests: precision_at_k, recall_at_k (denominator rules documented), mrr (full-list rank), ndcg_at_k (binary relevance)
- Edge cases fixed and tested: fewer than k retrieved, empty retrieval, duplicate ids collapsed keeping first rank, relevant item below k, no hit
- Source matching granularity: run.source_match chunk|document, document mode maps chunks through metadata.document_id with chunk-id fallback
- Metric registry refactored to per-run factories so k and source_match come from config; engine builds metrics through the registry
- Skip reasons surfaced: summaries carry the dominant skip reason and the run table prints it, so a no-retrieval pipeline says why instead of showing zeros
- ragproof compare: resolves run ids, unique prefixes, and 'latest'; prints per-metric deltas; warns when runs used different datasets
- docs/metrics.md documents every metric's exact computation
Notes:
- Freeze-time rejection of qa cases with empty expected_source_ids (RW-9) lands with 'ragproof freeze' in phase 4; at run time such cases skip with a reason today.
Next: Phase 3, judge layer and generation metrics.
Done:
- Repo scaffold, pyproject with uv lockfile, ruff and strict mypy, CI matrix (3 OS x 3 Python), CodeQL, gitleaks, dependency review, release workflow for PyPI Trusted Publishing
- CLI with all commands registered; run and check are implemented, the rest exit 3 with a message until their phase lands
- Exit code contract: 0 pass, 1 gate failure, 2 execution error, 3 config error
- Adapter protocol with capability flags, Python adapter (sync and async targets), HTTP adapter (JSONPath mapping, retries with Retry-After support, env-sourced headers, 4xx fail fast)
- SQLite run store: WAL mode, serialized writes, schema versioning with backup-before-migrate, full domain model from spec section 5
- Run engine: bounded concurrency, per-case timeout and error capture, incremental result persistence, --resume for interrupted runs
- Canonical JSON hashing for dataset and config hashes
- Secret redaction applied to logs and persisted error payloads
- Cost ledger scaffold (real accounting lands in phase 3)
- Example pipeline, example dataset and configs, full test suite
Decisions:
- TODO(decision): schema migrations use an in-repo versioned migration table instead of Alembic. PHASE_PLAN P1-09 asked for Alembic; the lighter mechanism satisfies RW-16 (version stored, newer-schema error, backup-before-migrate) with less machinery. Revisit if the schema grows.
Needs the account owner (cannot be done from this machine):
- Create the GitHub repo and push (P0-01)
- Configure PyPI Trusted Publishing and publish the 0.0.1 placeholder (P0-12)
Next: Phase 2, retrieval metrics and compare.