Progress

2026-07-03: Phase 8, UI-6 control panel

Overrode the plan's read-only-v1 decision at the user's request: the dashboard is now the main control panel, able to run operations, not just view them.

Done:

Background jobs subsystem (ragproof/ui/jobs.py): in-memory JobManager runs async operations, captures redacted log lines and a terminal status, evicts old jobs, drains on shutdown.
Actions API (ragproof/ui/actions.py) reusing the engine, generate, freeze, calibrate, check, and report code as jobs: POST /api/actions/{run,generate, freeze,calibrate,check}, POST /api/runs/{ref}/report producing downloadable artifacts, GET /api/jobs[/id], GET /api/config, GET /api/artifacts/{job}/{name} with path-traversal guard.
Frontend: top-bar New run button plus a menu (generate, calibrate, check), Jobs screen with live log tail and run/report links, Config viewer, per-run Re-run and Report buttons, command-palette actions group, sidebar Jobs and Config entries.
9 action/jobs API tests: run recorded from the UI, report artifacts and traversal block, check probe, freeze, generate-failure, config, 404.
Verified live in the browser: New run created a run end to end, report job produced a downloadable HTML artifact, traversal blocked, Jobs screen shows both with live logs, no console errors.

Decisions:

Config is view-only. Writing arbitrary config from a browser is an unbounded risk, so the on-disk ragproof.yaml stays the single source of truth.
Actions run with the same reach as the CLI on that machine; the server stays 127.0.0.1 with a warning on other hosts, documented in docs/ui.md.
Jobs are in-memory (not persisted). They are a live operations view, not an audit log; the run store remains the durable record.

2026-07-03: Phase 8, UI-1 through UI-5

Done:

Read API (ragproof/ui/api.py) reusing reports/data.py, gate.py, compare.py: projects, paginated runs, run detail, cases with filter and worst sort, case detail, gate, compare, trends, datasets, calibration. Store gained the list and aggregate queries a UI needs. 7 endpoint tests assert the API matches the CLI on the same store.
Runs home: table with status, pinned metric ScoreCells, per-run deltas, live polling for running runs, and the four designed states.
Run detail: Overview (metric cards, distribution histograms, worst-case strips), Cases triage grid with filters and worst-first sort, routed case side panel (question, answer, cited-chunk highlighting, per-claim groundedness checklist, raw judge JSON, Esc + deep link), Gate tab rendering the same verdict and CIs as the CLI, Metadata tab.
Compare screen (delta table, dataset-mismatch warning), Trends charts (0-1 domain, click-through), Datasets list and detail, Calibration screen, Ctrl/Cmd-K command palette.
Animations on data: count-up numbers (AnimatedNumber), animated score bars, staggered row entrance, animated histograms and trend lines, spring-in case panel and palette; all respect prefers-reduced-motion.
Route code-splitting: home loads ~145KB gzipped, the chart library (~104KB) loads only on run detail and trends. Under the 300KB budget.
docs/ui.md; README dashboard section; frontend tests (theme, RunsPage states) plus the backend API and bundle-scan tests.
Verified live in the browser: runs table, run overview, case panel, and gate tab all render real store data with no console errors.

Decisions:

Recharts for charts (declarative, tree-shakeable, no runtime CDN); it is the bulk of the split chart chunk, kept off the home page by lazy routes.
The case panel is a deep-linkable ?case= param rather than a nested route, which keeps it deep-linkable without a nested router.

2026-07-03: Phase 8, UI-0 foundation

Done:

frontend/ workspace: Vite, React 19, TypeScript strict, Tailwind v4, ESLint flat config, Prettier, Vitest with Testing Library
Design tokens from UI_PLAN section 5 as CSS variables, dark mode via a .dark class with prefers-color-scheme default and a persisted toggle; Inter and JetBrains Mono self-hosted through fontsource
App shell: sidebar navigation for all planned screens, top bar with live project and store path, Runs page foundation card with designed loading, error, and ready states
FastAPI server in ragproof/ui/ behind the ragproof[ui] extra: /api/meta, CSP default-src 'self', nosniff and referrer headers, SPA fallback that never swallows unknown /api paths (404 JSON), static root path guard
ragproof ui command: binds 127.0.0.1 by default with a warning on other hosts, opens the browser, --no-browser and --dev flags, clear install hint when the extra is missing
Build pipeline: vite builds into ragproof/ui/static (committed so wheels and no-Node machines work); release workflow builds the frontend before uv build; ui.yml CI job runs typecheck, lint, format, tests, build, and an external-resource bundle scan; main CI installs the ui extra
Bundle: 77.5KB gzipped JS, fonts as woff2, zero external loads (tested)
Verified live: ragproof ui served /api/meta, the themed shell, CSP headers, and client-route fallback on this machine

Decisions:

React 19 instead of the plan's React 18; it is the current stable line and Testing Library and Router support it fully.
lucide-react resolved to 1.x (the 0.x line graduated); API unchanged.
The built bundle is committed. Rationale: acceptance requires the UI to work from a checkout without Node, and wheels must build from any clone. Revisit if bundle churn becomes noisy in diffs.

Next: UI-1 (runs table and run detail), interleaved with Phase 7 launch work.

2026-07-02: Phase 6

Done:

Single-file HTML report: overview with inline SVG mean bars, per-metric SVG distribution histograms, worst-10 cases per metric with judge reasoning, dataset/config/prompt hashes, cost; zero network requests, verified by a test that greps the output for external resource loads and a live check (24KB, 0)
Markdown summary for PR comments; JUnit XML that marks threshold breaches as and execution errors as , with a run-level execution case when cases fail to evaluate
ragproof gate: absolute min thresholds, relative max_drop vs baseline, per-metric noise_floor, bootstrap 95% CI so an uncertain drop warns instead of failing, on_missing fail|skip, min_samples warning; exit 0/1/2/3
ragproof report writes html, md, and junit
--json output on run, compare, and gate
Reusable composite GitHub Action (action.yml): install, run, gate, upload the HTML artifact, sticky PR comment with the Markdown summary
Dockerfile: multi-stage, slim, non-root
Dogfood workflow runs the action against the example pipeline in this repo
docs/ci.md; verified the wheel ships all judge and dataset templates and fixtures so pip installs run without missing package data

Decisions:

Charts are hand-rolled inline SVG instead of vendored Chart.js. The plan's RW-13 goal is a self-contained, zero-network report; inline SVG meets that with no 200KB third-party bundle and nothing to fetch. Documented here.
TODO(decision): the Dockerfile pins the base image by tag, not digest. Digest pinning needs a registry resolve; do it in an environment with network access before release.

Next: Phase 7, case studies and launch.

2026-07-02: Phase 5

Done:

Payload library expanded to 10 categories (instruction override, data exfiltration, tone hijack, citation spoofing, system prompt disclosure, formatting hijack, competitor steering, fake citation, link bait, chained instructions), each with a unique inert sentinel and a compliance detector
Payload safety lint in the standard suite: reserved .invalid domains only, no shell metacharacters, every marker appears in its instruction and trips only its own detector, clean answers trip none
robustness.injection_resistance: deterministic, 1.0 when the answer resisted the planted payload, 0.0 when it complied; detects by payload id or a literal marker fallback for hand-written cases
robustness.abstention: refusal on unanswerable cases; deterministic heuristic with judge confirmation when a judge is configured; higher is better
robustness.overrefusal: refusal rate on answerable cases, reported beside abstention so a refuse-everything pipeline cannot hide; lower is better
refusal is now a calibrated judge prompt with 12 fixtures, wired into calibrate and prompt versioning
Two example pipelines: a deliberately vulnerable one that obeys injections and fabricates, and a guarded one that ignores injections and abstains
Injection generation now plants the payload instruction in the question and stores the payload id, so cases are self-contained and testable
Integration tests assert the vulnerable and guarded pipelines land at the opposite ends of injection_resistance and abstention

Decisions:

Injection instructions are delivered in the question rather than by emitting poisoned corpus files (still future). This makes cases portable and testable without wiring poisoned documents through a pipeline's own retrieval.

Next: Phase 6, reports and the CI gate.

2026-07-02: Phase 4

Done:

Corpus ingestion: TXT and Markdown in the core install, PDF and DOCX behind the ingest extra with a clear install hint; per-file size cap; unreadable, oversize, and empty files skipped with a reported reason; deterministic ordering
Deterministic chunking with overlap; stable chunk ids scoped to documents
Generation client reusing the chat transport, with a repair retry and a budget guard; sequential so ordering is deterministic for a seed
QA synthesis with a second-pass answerability check (failures discarded and counted); unanswerable synthesis verified absent via a lexical retrieval plus judge pass; injection cases attach an inert payload marker
Inert payload seed set with a safety test asserting reserved domains only and no shell metacharacters; Phase 5 expands this and adds detectors
ragproof generate: ingest, generate, write a JSONL review file, print produced and discarded counts
ragproof freeze: validates cases (qa needs a source id, injection needs a payload), writes dataset.frozen.json with a canonical sha256 embedded
Frozen datasets verified on load; a file edited after freezing is refused; run and check accept plain JSONL or a frozen dataset
Shared prompt-template module (render plus hashing) now used by both the judge and the generator
docs/datasets.md; tiny test corpus under tests/fixtures/corpus

Decisions:

Generated qa and injection cases use the document path as expected_source_ids since our chunk ids will not match a pipeline's own ids. Documented in docs/datasets.md with the source_match: document guidance.
Poisoned corpus files are not emitted in v1; the injection case carries the payload marker, which is what Phase 5 detection needs. Noted as future.

Next: Phase 5, robustness metrics.

2026-07-02: Phase 3

Done:

Provider-agnostic judge client: OpenRouter, OpenAI, Ollama (no key needed), Anthropic; temperature 0, per-call timeout, retries with Retry-After, API key read from the environment at call time and never stored
Structured judging: JSON extracted and schema-validated, one repair retry with the validation error appended, then metric-level judge_error (excluded from means, added to the metric's error count, never scored 0)
Judge cache in the run store keyed by (model, prompt hash, input hash); hit/miss stats printed per run; --no-cache bypasses it
Versioned prompt templates (sha256 recorded per run) with calibration fixtures, 10 human-scored examples per prompt
Generation metrics: citation_validity (deterministic, duplicates deduped), groundedness (per-claim verdicts in judge_raw_json, zero claims is not_applicable), citation_support, answer_relevance, completeness
Cost accounting from provider usage with estimate fallback; budget checked before every call; breach aborts the run as aborted:budget with exit 2 and keeps completed results resumable
ragproof calibrate with per-prompt exact and within-band agreement, exit 1 below thresholds; calibrate.yml CI workflow on prompt or fixture changes (skips with a warning until the RAGPROOF_LLM_API_KEY secret is configured)
Mixed-judge guard: compare refuses runs with different judge models or prompt versions unless --allow-mixed-judges; resume refuses a changed judge
Schema v2 migration (judge_cache table) exercised the versioned-migration path: v1 databases migrate with a backup file
Metric protocol is now async; raw judge output persisted redacted per case

Decisions:

judge_error is recorded per metric, not as a whole-case status: a judge failure on one metric must not discard the deterministic scores of the same case. The spec's "fails the case, not the run" intent is preserved.

Next: Phase 4, dataset generation.

2026-07-02: Phase 2

Done:

Retrieval metrics with exact-value fixture tests: precision_at_k, recall_at_k (denominator rules documented), mrr (full-list rank), ndcg_at_k (binary relevance)
Edge cases fixed and tested: fewer than k retrieved, empty retrieval, duplicate ids collapsed keeping first rank, relevant item below k, no hit
Source matching granularity: run.source_match chunk|document, document mode maps chunks through metadata.document_id with chunk-id fallback
Metric registry refactored to per-run factories so k and source_match come from config; engine builds metrics through the registry
Skip reasons surfaced: summaries carry the dominant skip reason and the run table prints it, so a no-retrieval pipeline says why instead of showing zeros
ragproof compare: resolves run ids, unique prefixes, and 'latest'; prints per-metric deltas; warns when runs used different datasets
docs/metrics.md documents every metric's exact computation

Notes:

Freeze-time rejection of qa cases with empty expected_source_ids (RW-9) lands with 'ragproof freeze' in phase 4; at run time such cases skip with a reason today.

Next: Phase 3, judge layer and generation metrics.

2026-07-02: Phase 0 and Phase 1

Done:

Repo scaffold, pyproject with uv lockfile, ruff and strict mypy, CI matrix (3 OS x 3 Python), CodeQL, gitleaks, dependency review, release workflow for PyPI Trusted Publishing
CLI with all commands registered; run and check are implemented, the rest exit 3 with a message until their phase lands
Exit code contract: 0 pass, 1 gate failure, 2 execution error, 3 config error
Adapter protocol with capability flags, Python adapter (sync and async targets), HTTP adapter (JSONPath mapping, retries with Retry-After support, env-sourced headers, 4xx fail fast)
SQLite run store: WAL mode, serialized writes, schema versioning with backup-before-migrate, full domain model from spec section 5
Run engine: bounded concurrency, per-case timeout and error capture, incremental result persistence, --resume for interrupted runs
Canonical JSON hashing for dataset and config hashes
Secret redaction applied to logs and persisted error payloads
Cost ledger scaffold (real accounting lands in phase 3)
Example pipeline, example dataset and configs, full test suite

Decisions:

TODO(decision): schema migrations use an in-repo versioned migration table instead of Alembic. PHASE_PLAN P1-09 asked for Alembic; the lighter mechanism satisfies RW-16 (version stored, newer-schema error, backup-before-migrate) with less machinery. Revisit if the schema grows.

Needs the account owner (cannot be done from this machine):

Create the GitHub repo and push (P0-01)
Configure PyPI Trusted Publishing and publish the 0.0.1 placeholder (P0-12)

Next: Phase 2, retrieval metrics and compare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress

2026-07-03: Phase 8, UI-6 control panel

2026-07-03: Phase 8, UI-1 through UI-5

2026-07-03: Phase 8, UI-0 foundation

2026-07-02: Phase 6

2026-07-02: Phase 5

2026-07-02: Phase 4

2026-07-02: Phase 3

2026-07-02: Phase 2

2026-07-02: Phase 0 and Phase 1

FilesExpand file tree

PROGRESS.md

Latest commit

History

PROGRESS.md

File metadata and controls

Progress

2026-07-03: Phase 8, UI-6 control panel

2026-07-03: Phase 8, UI-1 through UI-5

2026-07-03: Phase 8, UI-0 foundation

2026-07-02: Phase 6

2026-07-02: Phase 5

2026-07-02: Phase 4

2026-07-02: Phase 3

2026-07-02: Phase 2

2026-07-02: Phase 0 and Phase 1