Skip to content

Latest commit

 

History

History
266 lines (208 loc) · 19 KB

File metadata and controls

266 lines (208 loc) · 19 KB

RAGProof - Task List

Companion to PHASE_PLAN.md. Work strictly in phase order; do not start a phase until the previous phase's acceptance checks (bottom of each section) all pass. RW-n references are the real-world fixes defined in PHASE_PLAN.md §2.

Status legend: [ ] todo, [x] done, [~] in progress, [!] blocked


Phase 0 - Foundation

  • [~] P0-01 Create GitHub repo ragproof (name verified free 2026-07-02); MIT LICENSE; git init locally and push
  • P0-02 Scaffold repo layout per spec §10 (package dirs, tests/, examples/, docs/, empty __init__.pys)
  • P0-03 pyproject.toml: PEP 621 metadata, hatchling backend, single-sourced version, deps (typer, rich, pydantic v2, sqlalchemy 2.x, aiosqlite, httpx, tenacity, jinja2, jsonpath-ng, alembic), extras ingest (pypdf, python-docx) and dev (RW-12)
  • P0-04 uv environment + committed uv.lock
  • P0-05 Typer CLI stub: register generate, freeze, run, compare, gate, report, calibrate, check; each unimplemented command exits 3 with a clear message; ragproof --version works
  • P0-06 ExitCode enum: 0 pass, 1 gate failure, 2 execution error, 3 config error (RW-4); document in README
  • P0-07 ruff config (lint + format) and mypy strict config; both pass on the skeleton
  • P0-08 GitHub Actions CI: {ubuntu, windows, macos} × {3.11, 3.12, 3.13} matrix - lint, typecheck, tests, coverage (RW-11)
  • P0-09 Enable CodeQL, gitleaks secret scanning, dependency review workflows
  • P0-10 .env.example with every env var from spec §11, commented
  • P0-11 README stub (positioning line, install, exit-code contract) + PROGRESS.md created
  • [~] P0-12 PyPI Trusted Publishing (OIDC) workflow; publish 0.0.1 placeholder to reserve the name

Acceptance: uv run ragproof --help lists all commands on Windows + Linux; CI green on full matrix; pip install ragproof==0.0.1 works; tag phase-0-complete.


Phase 1 - Adapter layer, run store, run engine

Adapters

  • P1-01 Pydantic I/O models: RetrievedChunk, ChunkRef, RAGAnswer
  • P1-02 RAGAdapter protocol + capability flags supports_retrieval / supports_answer (RW-10)
  • P1-03 Python adapter: import-path loading; accept sync and async user implementations (sync via thread offload)
  • P1-04 HTTP adapter: JSONPath request/response mapping; auth from named env vars only (RW-18); per-call timeout
  • P1-05 HTTP retry policy: tenacity exponential backoff + jitter; retry 429/5xx/timeouts only; honor Retry-After; 4xx fail fast
  • P1-06 examples/minimal_python_adapter/ + examples/http_adapter_config.yaml

Run store

  • P1-07 SQLAlchemy 2.x async models: Project, Dataset, Case, Run, Result, MetricSummary (spec §5) + Run.status (running|completed|partial|aborted:*) and Result.status (ok|error|timeout|judge_error|skipped|not_applicable) (RW-2, RW-9)
  • P1-08 SQLite setup: WAL mode + busy timeout on connect; single-writer asyncio.Queue task - workers never write directly (RW-1)
  • P1-09 Alembic migrations from the first table; schema_version recorded; newer-schema DB → clear error; older DB → auto-migrate with backup file (RW-16)
  • P1-10 Canonical-JSON utility (sorted keys, UTF-8, fixed separators) + sha256 hashing helper; used for all config/dataset hashes (RW-8)

Engine

  • P1-11 Run loop: dataset iteration, asyncio.Semaphore(RAGPROOF_MAX_CONCURRENCY), per-case timeout, per-case error capture - errors recorded, never fatal, never scored 0 (RW-2)
  • P1-12 Incremental result persistence + ragproof run --resume <run_id> skipping completed cases (RW-2)
  • P1-13 Run manifest: config hash, dataset hash, seeds, package version, adapter label
  • P1-14 Trivial echo.exact_match metric wired end to end
  • P1-15 CostLedger scaffold (per-call entries; real accounting in P3)

Config, checks, security

  • P1-16 ragproof.yaml loader: strict Pydantic validation, unknown-key rejection with "did you mean" suggestions (RW-15)
  • P1-17 Env var layer per spec §11; referenced-but-unset vars named in errors
  • P1-18 ragproof check: validate config + env, probe adapter with one live question, verify DB writability (RW-15)
  • P1-19 Secret-redaction filter on logging and on persisted raw payloads (*_API_KEY|*_TOKEN|*_SECRET + bearer patterns) (RW-18)

Tests

  • P1-20 Both adapters tested against mocked targets (mapping, retries, timeout, 4xx fail-fast, Retry-After)
  • P1-21 Concurrency stress test: 32 cases, jittery mock adapter, zero SQLite lock errors (RW-1)
  • P1-22 Resume test: kill run mid-way, --resume completes only remaining cases
  • P1-23 Secret-leak test: planted API key never reaches logs or DB
  • P1-24 Exit-code tests: adapter down → 2; bad config → 3

Acceptance: 5-case JSONL run vs example adapter persists results + metadata; second run comparable; all P1 tests green; tag phase-1-complete.


Phase 2 - Retrieval metrics + compare

  • P2-01 Metric registry: stable string names, declared requirements (needs: expected_source_ids, retrieval), skip-with-reason plumbing (RW-10)
  • P2-02 retrieval.precision_at_k + retrieval.recall_at_k (k configurable, default 5)
  • P2-03 retrieval.mrr (no hit → 0)
  • P2-04 retrieval.ndcg_at_k (binary relevance; graded documented as future)
  • P2-05 Edge-case semantics implemented + fixture-tested with exact values: <k retrieved; empty retrieval; duplicate chunk IDs (dedupe keep-first-rank); empty expected set rejected at freeze (RW-9)
  • P2-06 Chunk-ID vs document-ID matching granularity (config; spec §17.4)
  • P2-07 MetricSummary aggregation: mean, p50, p95 + scored/skipped/error counts
  • P2-08 ragproof compare <run_a> <run_b>: per-metric deltas; skipped shown as skipped, never 0.00
  • P2-09 Graceful skip when adapter lacks retrieve or cases lack expected_source_ids, with reason surfaced (RW-10)

Acceptance: exact-value fixtures pass incl. all edge cases; no-retrieval adapter produces stated-skip output; tag phase-2-complete.


Phase 3 - Judge layer + generation metrics

Judge client

  • P3-01 Provider-agnostic judge client: OpenRouter, Ollama, OpenAI, Anthropic; temperature 0; per-call timeout + retries
  • P3-02 Structured JSON responses: Pydantic validation → one repair retry (validation error appended) → judge_error case status; never score 0, never drop silently (RW-7)
  • P3-03 Raw judge output persisted verbatim post-redaction (spec §14, RW-18)
  • P3-04 Judge cache: SQLite, key (model, prompt_hash, canonical_input_hash); hit/miss stats per run; --no-cache flag (RW-3, RW-6)
  • P3-05 Versioned prompt files in judge/prompts/; content hash recorded per run
  • P3-06 Mixed-judge guard: compare/gate refuse by default across different judge models/prompt hashes; --allow-mixed-judges override; output always labeled (RW-14)

Generation metrics

  • P3-07 generation.groundedness: claim decomposition, per-claim verdicts in judge_raw_json; zero-claim answers → not_applicable (RW-9)
  • P3-08 generation.citation_validity (deterministic; duplicate-ID semantics defined and tested)
  • P3-09 generation.citation_support (judge)
  • P3-10 generation.answer_relevance (judge)
  • P3-11 generation.completeness (judge; skipped-with-reason when no expected_answer)

Cost

  • P3-12 Cost accounting: provider-reported usage preferred, tokenizer estimate fallback; per-run cost in summary
  • P3-13 Budget enforcement: check before each call vs RAGPROOF_MAX_COST_USD; graceful stop → status aborted:budget, partial results persisted, exit 2 (RW-6)

Calibration

  • P3-14 ≥10 human-scored calibration fixtures per judge prompt in judge/fixtures/
  • P3-15 ragproof calibrate: exact + within-1-band agreement report; thresholds in config
  • P3-16 CI job: run calibration when judge/prompts/** or judge/fixtures/** change; fail below threshold (spec §7.4)

Tests

  • P3-17 All judge unit tests use recorded fixtures - no live LLM calls in CI
  • P3-18 Planted good/bad answers on example corpus separate cleanly on groundedness
  • P3-19 Malformed-judge-output path test: repair retry → judge_error → run completes, error visible in summary
  • P3-20 Cache reproducibility test: unchanged re-run ≈ $0 and byte-identical judge-metric scores
  • P3-21 Budget-breach test: mid-run stop, partial persisted, exit 2

Acceptance: all P3 tests green; calibration gate demonstrated on a deliberately bad prompt change; tag phase-3-complete.


Phase 4 - Dataset generation

  • P4-01 Corpus ingestion: TXT/MD in core; PDF/DOCX behind ragproof[ingest] with clear missing-extra hint (RW-12)
  • P4-02 Ingestion safety: per-file size cap; extraction failures skipped with a report line, never silently (RW-9 spirit)
  • P4-03 Chunk sampling with explicit recorded seed (deterministic sampling)
  • P4-04 QA synthesis + second-pass answerability verification; discards counted and reported
  • P4-05 Unanswerable synthesis, verified absent via retrieval + judge pass
  • P4-06 Injection case generation: poisoned document variants from payload library, expected non-compliance markers registered
  • P4-07 JSONL human-review file emission (editable before freeze)
  • P4-08 ragproof freeze: corpus_hash + dataset hash via canonical JSON; generation metadata (models, seeds, prompt hashes, discard counts) embedded (RW-8)
  • P4-09 Frozen-dataset integrity: hash verified on load; mutated file refused with clear message
  • P4-10 Tiny test corpus committed under tests/fixtures/
  • P4-11 Tests: same seed + corpus → identical sampling/ordering; corrupt PDF skipped gracefully; freeze/verify round-trip

Acceptance: generate on tiny corpus yields spot-checkable answerable QA cases with reported discard rate; immutability enforced; tag phase-4-complete.


Phase 5 - Robustness metrics

  • P5-01 Payload library, 10+ types: instruction override, exfiltration URL (*.invalid only), tone hijack, citation spoofing, system-prompt fishing, formatting hijack, steering, fake-citation injection, link-bait, chained instructions
  • P5-02 Per-payload deterministic compliance detectors (string/regex), each with positive and negative fixture tests
  • P5-03 Payload safety lint test: inert markers only, *.invalid/example.com URLs only, no shell commands or real endpoints - runs in the standard suite (RW-17)
  • P5-04 robustness.injection_resistance = 1 − compliance rate
  • P5-05 robustness.abstention on unanswerable cases: refusal heuristic + judge confirmation
  • P5-06 robustness.overrefusal on answerable cases; reported side by side with abstention (RW-5)
  • P5-07 Fabrication-on-unanswerable weighted prominently in summaries (spec §7.3)
  • P5-08 examples/: deliberately vulnerable pipeline + guarded pipeline
  • P5-09 Integration test: vulnerable scores low / guarded scores high on injection resistance; always-refusing pipeline shows high abstention and high overrefusal

Acceptance: all detectors fixture-tested; safety lint in CI; example-pipeline contrast asserted; tag phase-5-complete.


Phase 6 - Reports, CI gate, distribution

Reports

  • P6-01 HTML report: single self-contained file - overview, per-metric distributions, run comparison, worst-10 cases per metric (question/answer/context/judge reasoning), skip/error counts, cost, dataset/config/prompt hashes
  • P6-02 Vendor Chart.js into the template with license header; zero network requests - automated check greps report for external resource loads (RW-13)
  • P6-03 Markdown summary (PR-comment sized)
  • P6-04 JUnit XML: one test per metric; execution errors → <error>, threshold breaches → <failure> (RW-4)

Gate

  • P6-05 ragproof gate: absolute thresholds + relative-to-baseline deltas
  • P6-06 Per-metric noise_floor config; bootstrap 95% CIs on judge-backed metrics; in-noise deltas warn instead of fail (RW-3)
  • P6-07 on_missing: fail|skip behavior for skipped metrics (default fail) (RW-10)
  • P6-08 Minimum-sample warning when n < 30 (RW-3)
  • P6-09 Exit-code contract enforced end to end (0/1/2/3) with tests
  • P6-10 --json output on run/compare/gate

Distribution

  • P6-11 action.yml reusable GitHub Action: install → run → gate → upload HTML artifact → sticky PR comment with Markdown summary
  • P6-12 Dockerfile: multi-stage, slim, non-root user, pinned base digest
  • P6-13 Dogfood: the Action runs in this repo's own CI against the example pipeline
  • P6-14 Integration tests: in-noise delta passes gate; genuine regression fails it; HTML opens from disk offline

Acceptance: gate exits correctly in CI with native JUnit rendering; Action end-to-end green incl. PR comment; tag phase-6-complete.


Phase 7 - Case studies and launch

  • P7-01 DOC-007-AI adapter (native citation mapping)
  • P7-02 Legate Agent adapter (kb.search + RAG answer endpoints)
  • P7-03 Run both case studies; fix ≥1 real issue each that the scores expose; record before/after numbers
  • P7-04 Publish before/after numbers in DOC-007-AI and Legate READMEs
  • P7-05 docs/metrics.md: exact computation of every metric incl. all edge-case semantics (RW-9)
  • P7-06 docs/quickstart.md, docs/adapters.md, docs/ci.md
  • P7-07 Quickstart verified verbatim on clean Windows and Linux machines
  • P7-08 PyPI 1.0.0 release via Trusted Publishing
  • P7-09 Demo GIF: degrading PR → red gate → fix → green gate
  • P7-10 README final: leads with GIF + case-study numbers (spec §18)
  • P7-11 Launch post draft

Acceptance: both case studies show real numbers + ≥1 real improvement; pip install ragproof → 1.0.0; tag v1.0.0.


Phase 8 - Web UI (design: UI_PLAN.md)

UI-0 Foundation

  • P8-01 frontend/ workspace: Vite, React 18, TypeScript strict, Tailwind v4, shadcn/ui; ESLint + Prettier aligned with repo style
  • P8-02 Design tokens from UI_PLAN §5 (color, type, spacing) with dark mode as a first-class theme; self-hosted Inter + JetBrains Mono
  • P8-03 FastAPI server in ragproof/ui/ behind the ragproof[ui] extra; binds 127.0.0.1 by default, warning on --host override
  • P8-04 ragproof ui command: starts server, opens browser, --dev proxies Vite; clear install hint when the extra is missing
  • P8-05 Build pipeline: CI builds the bundle into ragproof/ui/static/, wheel ships it, Python jobs never need Node
  • P8-06 Bundle scan test: no external URLs in built assets; CSP default-src 'self'
  • P8-07 UI CI job: typecheck, lint, Vitest, build

UI-1 Runs and Run detail

  • P8-08 Read API: /api/meta, /api/projects, paginated /api/runs, /api/runs/{ref} reusing reports/data.py
  • P8-09 Runs table: status dot, label, relative time, case counts, pinned ScoreCells with micro-distributions, delta vs selected baseline
  • P8-10 Column picker persisted per project; two-row select enables Compare
  • P8-11 Run detail Overview: header chips (judge, hashes, cost, cache), metric cards with histograms and threshold lines, worst-cases strip
  • P8-12 Metadata tab; polling for running runs
  • P8-13 Loading, empty (teaches the CLI command), error, and partial states on every screen
  • P8-14 Consistency test: /api/runs/{ref} equals CLI --json on the same store

UI-2 Case triage

  • P8-15 /api/runs/{ref}/cases with filters, sort, cursor pagination; /cases/{key} detail
  • P8-16 Virtualized cases grid with metric columns, worst-first sort, status/kind filters
  • P8-17 Routed case side panel: question, answer, retrieved chunks with cited highlighted, per-claim verdict checklist from judge_raw, raw JSON CodeBlock
  • P8-18 Keyboard triage loop: j/k, Esc, deep links restore filter + selection

UI-3 Compare and Trends

  • P8-19 /api/compare + Compare screen: delta table with CI whiskers, mixed-judge blocking banner, dataset mismatch warning
  • P8-20 Changed-cases diff grid, worst regression first; split case view baseline vs candidate
  • P8-21 /api/trends + Trends screen: mean per run, 0-1 domain, threshold lines, verdict-colored points, click-through to runs

UI-4 Gate, Datasets, Calibration

  • P8-22 Gate tab rendering GateOutcome via /api/runs/{ref}/gate; identical verdicts to the CLI asserted in CI
  • P8-23 Datasets list and detail (generation metadata, case browser, runs over dataset)
  • P8-24 Calibration screen with agreement bars vs thresholds
  • P8-25 Command palette (Ctrl/Cmd-K): jump to run, case, screen

UI-5 Polish and release

  • P8-26 A11y pass: AA contrast both themes, focus order, reduced motion, no color-only status
  • P8-27 Performance pass against a 1,000-run seeded store; bundle under 300KB gzipped, route code-splitting
  • P8-28 Playwright smoke in CI: boot ragproof ui, walk Runs -> Run -> Case -> Compare
  • P8-29 docs/ui.md; README screenshots and GIF
  • P8-30 Ship: wheel includes bundle, ragproof[ui] extra documented, tag phase-8-complete

Acceptance: ragproof ui gives a keyboard-operable, dark-mode-native dashboard whose every number matches the CLI on the same store; quality bar in UI_PLAN §8 met.

UI-6 Control panel (overrides read-only v1 per user request)

  • P8-31 Background jobs subsystem: async JobManager with redacted log capture, terminal status, eviction, drain on shutdown
  • P8-32 Actions API reusing engine/generate/freeze/calibrate/check/report as jobs; jobs and config endpoints; report artifacts with path-traversal guard
  • P8-33 Frontend: New run + actions menu, Jobs screen with live logs, Config viewer, per-run Re-run and Report, palette actions, sidebar entries
  • P8-34 9 action/jobs API tests; verified live in the browser (run, report download, traversal block, jobs render)
  • P8-35 docs/ui.md control-panel section; security posture documented (localhost, view-only config)

Acceptance: the dashboard starts runs, generates datasets, calibrates, checks, and builds downloadable reports as background jobs with live logs, all reusing the CLI code paths.


Cross-cutting (every phase, every PR)

  • Conventional commits; small reviewable diffs
  • Coverage ≥ 85% on metrics/, engine.py, judge/
  • docs/metrics.md updated in the same PR as any scoring change
  • .env.example updated in the same PR as any new env var
  • PROGRESS.md updated at each phase end
  • All randomness via seeded random.Random, seeds recorded
  • End-of-phase tag phase-N-complete