Companion to PHASE_PLAN.md. Work strictly in phase order; do not
start a phase until the previous phase's acceptance checks (bottom of each section)
all pass. RW-n references are the real-world fixes defined in PHASE_PLAN.md §2.
Status legend: [ ] todo, [x] done, [~] in progress, [!] blocked
- [~] P0-01 Create GitHub repo
ragproof(name verified free 2026-07-02); MIT LICENSE;git initlocally and push - P0-02 Scaffold repo layout per spec §10 (package dirs,
tests/,examples/,docs/, empty__init__.pys) - P0-03
pyproject.toml: PEP 621 metadata, hatchling backend, single-sourced version, deps (typer, rich, pydantic v2, sqlalchemy 2.x, aiosqlite, httpx, tenacity, jinja2, jsonpath-ng, alembic), extrasingest(pypdf, python-docx) anddev(RW-12) - P0-04
uvenvironment + committeduv.lock - P0-05 Typer CLI stub: register
generate,freeze,run,compare,gate,report,calibrate,check; each unimplemented command exits 3 with a clear message;ragproof --versionworks - P0-06
ExitCodeenum: 0 pass, 1 gate failure, 2 execution error, 3 config error (RW-4); document in README - P0-07 ruff config (lint + format) and mypy strict config; both pass on the skeleton
- P0-08 GitHub Actions CI:
{ubuntu, windows, macos} × {3.11, 3.12, 3.13}matrix - lint, typecheck, tests, coverage (RW-11) - P0-09 Enable CodeQL, gitleaks secret scanning, dependency review workflows
- P0-10
.env.examplewith every env var from spec §11, commented - P0-11 README stub (positioning line, install, exit-code contract) +
PROGRESS.mdcreated - [~] P0-12 PyPI Trusted Publishing (OIDC) workflow; publish
0.0.1placeholder to reserve the name
Acceptance: uv run ragproof --help lists all commands on Windows + Linux; CI green on full matrix; pip install ragproof==0.0.1 works; tag phase-0-complete.
- P1-01 Pydantic I/O models:
RetrievedChunk,ChunkRef,RAGAnswer - P1-02
RAGAdapterprotocol + capability flagssupports_retrieval/supports_answer(RW-10) - P1-03 Python adapter: import-path loading; accept sync and async user implementations (sync via thread offload)
- P1-04 HTTP adapter: JSONPath request/response mapping; auth from named env vars only (RW-18); per-call timeout
- P1-05 HTTP retry policy: tenacity exponential backoff + jitter; retry 429/5xx/timeouts only; honor
Retry-After; 4xx fail fast - P1-06
examples/minimal_python_adapter/+examples/http_adapter_config.yaml
- P1-07 SQLAlchemy 2.x async models:
Project,Dataset,Case,Run,Result,MetricSummary(spec §5) +Run.status(running|completed|partial|aborted:*) andResult.status(ok|error|timeout|judge_error|skipped|not_applicable) (RW-2, RW-9) - P1-08 SQLite setup: WAL mode + busy timeout on connect; single-writer
asyncio.Queuetask - workers never write directly (RW-1) - P1-09 Alembic migrations from the first table;
schema_versionrecorded; newer-schema DB → clear error; older DB → auto-migrate with backup file (RW-16) - P1-10 Canonical-JSON utility (sorted keys, UTF-8, fixed separators) + sha256 hashing helper; used for all config/dataset hashes (RW-8)
- P1-11 Run loop: dataset iteration,
asyncio.Semaphore(RAGPROOF_MAX_CONCURRENCY), per-case timeout, per-case error capture - errors recorded, never fatal, never scored 0 (RW-2) - P1-12 Incremental result persistence +
ragproof run --resume <run_id>skipping completed cases (RW-2) - P1-13 Run manifest: config hash, dataset hash, seeds, package version, adapter label
- P1-14 Trivial
echo.exact_matchmetric wired end to end - P1-15
CostLedgerscaffold (per-call entries; real accounting in P3)
- P1-16
ragproof.yamlloader: strict Pydantic validation, unknown-key rejection with "did you mean" suggestions (RW-15) - P1-17 Env var layer per spec §11; referenced-but-unset vars named in errors
- P1-18
ragproof check: validate config + env, probe adapter with one live question, verify DB writability (RW-15) - P1-19 Secret-redaction filter on logging and on persisted raw payloads (
*_API_KEY|*_TOKEN|*_SECRET+ bearer patterns) (RW-18)
- P1-20 Both adapters tested against mocked targets (mapping, retries, timeout, 4xx fail-fast,
Retry-After) - P1-21 Concurrency stress test: 32 cases, jittery mock adapter, zero SQLite lock errors (RW-1)
- P1-22 Resume test: kill run mid-way,
--resumecompletes only remaining cases - P1-23 Secret-leak test: planted API key never reaches logs or DB
- P1-24 Exit-code tests: adapter down → 2; bad config → 3
Acceptance: 5-case JSONL run vs example adapter persists results + metadata; second run comparable; all P1 tests green; tag phase-1-complete.
- P2-01 Metric registry: stable string names, declared requirements (
needs: expected_source_ids, retrieval), skip-with-reason plumbing (RW-10) - P2-02
retrieval.precision_at_k+retrieval.recall_at_k(k configurable, default 5) - P2-03
retrieval.mrr(no hit → 0) - P2-04
retrieval.ndcg_at_k(binary relevance; graded documented as future) - P2-05 Edge-case semantics implemented + fixture-tested with exact values: <k retrieved; empty retrieval; duplicate chunk IDs (dedupe keep-first-rank); empty expected set rejected at freeze (RW-9)
- P2-06 Chunk-ID vs document-ID matching granularity (config; spec §17.4)
- P2-07
MetricSummaryaggregation: mean, p50, p95 + scored/skipped/error counts - P2-08
ragproof compare <run_a> <run_b>: per-metric deltas; skipped shown as skipped, never 0.00 - P2-09 Graceful skip when adapter lacks
retrieveor cases lackexpected_source_ids, with reason surfaced (RW-10)
Acceptance: exact-value fixtures pass incl. all edge cases; no-retrieval adapter produces stated-skip output; tag phase-2-complete.
- P3-01 Provider-agnostic judge client: OpenRouter, Ollama, OpenAI, Anthropic; temperature 0; per-call timeout + retries
- P3-02 Structured JSON responses: Pydantic validation → one repair retry (validation error appended) →
judge_errorcase status; never score 0, never drop silently (RW-7) - P3-03 Raw judge output persisted verbatim post-redaction (spec §14, RW-18)
- P3-04 Judge cache: SQLite, key
(model, prompt_hash, canonical_input_hash); hit/miss stats per run;--no-cacheflag (RW-3, RW-6) - P3-05 Versioned prompt files in
judge/prompts/; content hash recorded per run - P3-06 Mixed-judge guard:
compare/gaterefuse by default across different judge models/prompt hashes;--allow-mixed-judgesoverride; output always labeled (RW-14)
- P3-07
generation.groundedness: claim decomposition, per-claim verdicts injudge_raw_json; zero-claim answers →not_applicable(RW-9) - P3-08
generation.citation_validity(deterministic; duplicate-ID semantics defined and tested) - P3-09
generation.citation_support(judge) - P3-10
generation.answer_relevance(judge) - P3-11
generation.completeness(judge; skipped-with-reason when noexpected_answer)
- P3-12 Cost accounting: provider-reported usage preferred, tokenizer estimate fallback; per-run cost in summary
- P3-13 Budget enforcement: check before each call vs
RAGPROOF_MAX_COST_USD; graceful stop → statusaborted:budget, partial results persisted, exit 2 (RW-6)
- P3-14 ≥10 human-scored calibration fixtures per judge prompt in
judge/fixtures/ - P3-15
ragproof calibrate: exact + within-1-band agreement report; thresholds in config - P3-16 CI job: run calibration when
judge/prompts/**orjudge/fixtures/**change; fail below threshold (spec §7.4)
- P3-17 All judge unit tests use recorded fixtures - no live LLM calls in CI
- P3-18 Planted good/bad answers on example corpus separate cleanly on groundedness
- P3-19 Malformed-judge-output path test: repair retry →
judge_error→ run completes, error visible in summary - P3-20 Cache reproducibility test: unchanged re-run ≈ $0 and byte-identical judge-metric scores
- P3-21 Budget-breach test: mid-run stop, partial persisted, exit 2
Acceptance: all P3 tests green; calibration gate demonstrated on a deliberately bad prompt change; tag phase-3-complete.
- P4-01 Corpus ingestion: TXT/MD in core; PDF/DOCX behind
ragproof[ingest]with clear missing-extra hint (RW-12) - P4-02 Ingestion safety: per-file size cap; extraction failures skipped with a report line, never silently (RW-9 spirit)
- P4-03 Chunk sampling with explicit recorded seed (deterministic sampling)
- P4-04 QA synthesis + second-pass answerability verification; discards counted and reported
- P4-05 Unanswerable synthesis, verified absent via retrieval + judge pass
- P4-06 Injection case generation: poisoned document variants from payload library, expected non-compliance markers registered
- P4-07 JSONL human-review file emission (editable before freeze)
- P4-08
ragproof freeze:corpus_hash+ dataset hash via canonical JSON; generation metadata (models, seeds, prompt hashes, discard counts) embedded (RW-8) - P4-09 Frozen-dataset integrity: hash verified on load; mutated file refused with clear message
- P4-10 Tiny test corpus committed under
tests/fixtures/ - P4-11 Tests: same seed + corpus → identical sampling/ordering; corrupt PDF skipped gracefully; freeze/verify round-trip
Acceptance: generate on tiny corpus yields spot-checkable answerable QA cases with reported discard rate; immutability enforced; tag phase-4-complete.
- P5-01 Payload library, 10+ types: instruction override, exfiltration URL (
*.invalidonly), tone hijack, citation spoofing, system-prompt fishing, formatting hijack, steering, fake-citation injection, link-bait, chained instructions - P5-02 Per-payload deterministic compliance detectors (string/regex), each with positive and negative fixture tests
- P5-03 Payload safety lint test: inert markers only,
*.invalid/example.comURLs only, no shell commands or real endpoints - runs in the standard suite (RW-17) - P5-04
robustness.injection_resistance= 1 − compliance rate - P5-05
robustness.abstentionon unanswerable cases: refusal heuristic + judge confirmation - P5-06
robustness.overrefusalon answerable cases; reported side by side with abstention (RW-5) - P5-07 Fabrication-on-unanswerable weighted prominently in summaries (spec §7.3)
- P5-08
examples/: deliberately vulnerable pipeline + guarded pipeline - P5-09 Integration test: vulnerable scores low / guarded scores high on injection resistance; always-refusing pipeline shows high abstention and high overrefusal
Acceptance: all detectors fixture-tested; safety lint in CI; example-pipeline contrast asserted; tag phase-5-complete.
- P6-01 HTML report: single self-contained file - overview, per-metric distributions, run comparison, worst-10 cases per metric (question/answer/context/judge reasoning), skip/error counts, cost, dataset/config/prompt hashes
- P6-02 Vendor Chart.js into the template with license header; zero network requests - automated check greps report for external resource loads (RW-13)
- P6-03 Markdown summary (PR-comment sized)
- P6-04 JUnit XML: one test per metric; execution errors →
<error>, threshold breaches →<failure>(RW-4)
- P6-05
ragproof gate: absolute thresholds + relative-to-baseline deltas - P6-06 Per-metric
noise_floorconfig; bootstrap 95% CIs on judge-backed metrics; in-noise deltas warn instead of fail (RW-3) - P6-07
on_missing: fail|skipbehavior for skipped metrics (default fail) (RW-10) - P6-08 Minimum-sample warning when n < 30 (RW-3)
- P6-09 Exit-code contract enforced end to end (0/1/2/3) with tests
- P6-10
--jsonoutput onrun/compare/gate
- P6-11
action.ymlreusable GitHub Action: install → run → gate → upload HTML artifact → sticky PR comment with Markdown summary - P6-12 Dockerfile: multi-stage, slim, non-root user, pinned base digest
- P6-13 Dogfood: the Action runs in this repo's own CI against the example pipeline
- P6-14 Integration tests: in-noise delta passes gate; genuine regression fails it; HTML opens from disk offline
Acceptance: gate exits correctly in CI with native JUnit rendering; Action end-to-end green incl. PR comment; tag phase-6-complete.
- P7-01 DOC-007-AI adapter (native citation mapping)
- P7-02 Legate Agent adapter (
kb.search+ RAG answer endpoints) - P7-03 Run both case studies; fix ≥1 real issue each that the scores expose; record before/after numbers
- P7-04 Publish before/after numbers in DOC-007-AI and Legate READMEs
- P7-05
docs/metrics.md: exact computation of every metric incl. all edge-case semantics (RW-9) - P7-06
docs/quickstart.md,docs/adapters.md,docs/ci.md - P7-07 Quickstart verified verbatim on clean Windows and Linux machines
- P7-08 PyPI
1.0.0release via Trusted Publishing - P7-09 Demo GIF: degrading PR → red gate → fix → green gate
- P7-10 README final: leads with GIF + case-study numbers (spec §18)
- P7-11 Launch post draft
Acceptance: both case studies show real numbers + ≥1 real improvement; pip install ragproof → 1.0.0; tag v1.0.0.
Phase 8 - Web UI (design: UI_PLAN.md)
- P8-01
frontend/workspace: Vite, React 18, TypeScript strict, Tailwind v4, shadcn/ui; ESLint + Prettier aligned with repo style - P8-02 Design tokens from UI_PLAN §5 (color, type, spacing) with dark mode as a first-class theme; self-hosted Inter + JetBrains Mono
- P8-03 FastAPI server in
ragproof/ui/behind theragproof[ui]extra; binds 127.0.0.1 by default, warning on--hostoverride - P8-04
ragproof uicommand: starts server, opens browser,--devproxies Vite; clear install hint when the extra is missing - P8-05 Build pipeline: CI builds the bundle into
ragproof/ui/static/, wheel ships it, Python jobs never need Node - P8-06 Bundle scan test: no external URLs in built assets; CSP
default-src 'self' - P8-07 UI CI job: typecheck, lint, Vitest, build
- P8-08 Read API:
/api/meta,/api/projects, paginated/api/runs,/api/runs/{ref}reusingreports/data.py - P8-09 Runs table: status dot, label, relative time, case counts, pinned ScoreCells with micro-distributions, delta vs selected baseline
- P8-10 Column picker persisted per project; two-row select enables Compare
- P8-11 Run detail Overview: header chips (judge, hashes, cost, cache), metric cards with histograms and threshold lines, worst-cases strip
- P8-12 Metadata tab; polling for running runs
- P8-13 Loading, empty (teaches the CLI command), error, and partial states on every screen
- P8-14 Consistency test:
/api/runs/{ref}equals CLI--jsonon the same store
- P8-15
/api/runs/{ref}/caseswith filters, sort, cursor pagination;/cases/{key}detail - P8-16 Virtualized cases grid with metric columns, worst-first sort, status/kind filters
- P8-17 Routed case side panel: question, answer, retrieved chunks with cited highlighted, per-claim verdict checklist from judge_raw, raw JSON CodeBlock
- P8-18 Keyboard triage loop: j/k, Esc, deep links restore filter + selection
- P8-19
/api/compare+ Compare screen: delta table with CI whiskers, mixed-judge blocking banner, dataset mismatch warning - P8-20 Changed-cases diff grid, worst regression first; split case view baseline vs candidate
- P8-21
/api/trends+ Trends screen: mean per run, 0-1 domain, threshold lines, verdict-colored points, click-through to runs
- P8-22 Gate tab rendering GateOutcome via
/api/runs/{ref}/gate; identical verdicts to the CLI asserted in CI - P8-23 Datasets list and detail (generation metadata, case browser, runs over dataset)
- P8-24 Calibration screen with agreement bars vs thresholds
- P8-25 Command palette (Ctrl/Cmd-K): jump to run, case, screen
- P8-26 A11y pass: AA contrast both themes, focus order, reduced motion, no color-only status
- P8-27 Performance pass against a 1,000-run seeded store; bundle under 300KB gzipped, route code-splitting
- P8-28 Playwright smoke in CI: boot
ragproof ui, walk Runs -> Run -> Case -> Compare - P8-29
docs/ui.md; README screenshots and GIF - P8-30 Ship: wheel includes bundle,
ragproof[ui]extra documented, tagphase-8-complete
Acceptance: ragproof ui gives a keyboard-operable, dark-mode-native dashboard whose every number matches the CLI on the same store; quality bar in UI_PLAN §8 met.
- P8-31 Background jobs subsystem: async JobManager with redacted log capture, terminal status, eviction, drain on shutdown
- P8-32 Actions API reusing engine/generate/freeze/calibrate/check/report as jobs; jobs and config endpoints; report artifacts with path-traversal guard
- P8-33 Frontend: New run + actions menu, Jobs screen with live logs, Config viewer, per-run Re-run and Report, palette actions, sidebar entries
- P8-34 9 action/jobs API tests; verified live in the browser (run, report download, traversal block, jobs render)
- P8-35 docs/ui.md control-panel section; security posture documented (localhost, view-only config)
Acceptance: the dashboard starts runs, generates datasets, calibrates, checks, and builds downloadable reports as background jobs with live logs, all reusing the CLI code paths.
- Conventional commits; small reviewable diffs
- Coverage ≥ 85% on
metrics/,engine.py,judge/ -
docs/metrics.mdupdated in the same PR as any scoring change -
.env.exampleupdated in the same PR as any new env var -
PROGRESS.mdupdated at each phase end - All randomness via seeded
random.Random, seeds recorded - End-of-phase tag
phase-N-complete