RAGProof - Task List

Companion to PHASE_PLAN.md. Work strictly in phase order; do not start a phase until the previous phase's acceptance checks (bottom of each section) all pass. RW-n references are the real-world fixes defined in PHASE_PLAN.md §2.

Status legend: [ ] todo, [x] done, [~] in progress, [!] blocked

Phase 0 - Foundation

Acceptance: uv run ragproof --help lists all commands on Windows + Linux; CI green on full matrix; pip install ragproof==0.0.1 works; tag phase-0-complete.

Phase 1 - Adapter layer, run store, run engine

Adapters

P1-01 Pydantic I/O models: RetrievedChunk, ChunkRef, RAGAnswer
P1-02 RAGAdapter protocol + capability flags supports_retrieval / supports_answer (RW-10)
P1-03 Python adapter: import-path loading; accept sync and async user implementations (sync via thread offload)
P1-04 HTTP adapter: JSONPath request/response mapping; auth from named env vars only (RW-18); per-call timeout
P1-05 HTTP retry policy: tenacity exponential backoff + jitter; retry 429/5xx/timeouts only; honor Retry-After; 4xx fail fast
P1-06 examples/minimal_python_adapter/ + examples/http_adapter_config.yaml

Run store

P1-07 SQLAlchemy 2.x async models: Project, Dataset, Case, Run, Result, MetricSummary (spec §5) + Run.status (running|completed|partial|aborted:*) and Result.status (ok|error|timeout|judge_error|skipped|not_applicable) (RW-2, RW-9)
P1-08 SQLite setup: WAL mode + busy timeout on connect; single-writer asyncio.Queue task - workers never write directly (RW-1)
P1-09 Alembic migrations from the first table; schema_version recorded; newer-schema DB → clear error; older DB → auto-migrate with backup file (RW-16)
P1-10 Canonical-JSON utility (sorted keys, UTF-8, fixed separators) + sha256 hashing helper; used for all config/dataset hashes (RW-8)

Engine

P1-11 Run loop: dataset iteration, asyncio.Semaphore(RAGPROOF_MAX_CONCURRENCY), per-case timeout, per-case error capture - errors recorded, never fatal, never scored 0 (RW-2)
P1-12 Incremental result persistence + ragproof run --resume <run_id> skipping completed cases (RW-2)
P1-13 Run manifest: config hash, dataset hash, seeds, package version, adapter label
P1-14 Trivial echo.exact_match metric wired end to end
P1-15 CostLedger scaffold (per-call entries; real accounting in P3)

Config, checks, security

P1-16 ragproof.yaml loader: strict Pydantic validation, unknown-key rejection with "did you mean" suggestions (RW-15)
P1-17 Env var layer per spec §11; referenced-but-unset vars named in errors
P1-18 ragproof check: validate config + env, probe adapter with one live question, verify DB writability (RW-15)
P1-19 Secret-redaction filter on logging and on persisted raw payloads (*_API_KEY|*_TOKEN|*_SECRET + bearer patterns) (RW-18)

Tests

P1-20 Both adapters tested against mocked targets (mapping, retries, timeout, 4xx fail-fast, Retry-After)
P1-21 Concurrency stress test: 32 cases, jittery mock adapter, zero SQLite lock errors (RW-1)
P1-22 Resume test: kill run mid-way, --resume completes only remaining cases
P1-23 Secret-leak test: planted API key never reaches logs or DB
P1-24 Exit-code tests: adapter down → 2; bad config → 3

Acceptance: 5-case JSONL run vs example adapter persists results + metadata; second run comparable; all P1 tests green; tag phase-1-complete.

Phase 2 - Retrieval metrics + compare

P2-01 Metric registry: stable string names, declared requirements (needs: expected_source_ids, retrieval), skip-with-reason plumbing (RW-10)
P2-02 retrieval.precision_at_k + retrieval.recall_at_k (k configurable, default 5)
P2-03 retrieval.mrr (no hit → 0)
P2-04 retrieval.ndcg_at_k (binary relevance; graded documented as future)
P2-05 Edge-case semantics implemented + fixture-tested with exact values: <k retrieved; empty retrieval; duplicate chunk IDs (dedupe keep-first-rank); empty expected set rejected at freeze (RW-9)
P2-06 Chunk-ID vs document-ID matching granularity (config; spec §17.4)
P2-07 MetricSummary aggregation: mean, p50, p95 + scored/skipped/error counts
P2-08 ragproof compare <run_a> <run_b>: per-metric deltas; skipped shown as skipped, never 0.00
P2-09 Graceful skip when adapter lacks retrieve or cases lack expected_source_ids, with reason surfaced (RW-10)

Acceptance: exact-value fixtures pass incl. all edge cases; no-retrieval adapter produces stated-skip output; tag phase-2-complete.

Phase 3 - Judge layer + generation metrics

Judge client

P3-01 Provider-agnostic judge client: OpenRouter, Ollama, OpenAI, Anthropic; temperature 0; per-call timeout + retries
P3-02 Structured JSON responses: Pydantic validation → one repair retry (validation error appended) → judge_error case status; never score 0, never drop silently (RW-7)
P3-03 Raw judge output persisted verbatim post-redaction (spec §14, RW-18)
P3-04 Judge cache: SQLite, key (model, prompt_hash, canonical_input_hash); hit/miss stats per run; --no-cache flag (RW-3, RW-6)
P3-05 Versioned prompt files in judge/prompts/; content hash recorded per run
P3-06 Mixed-judge guard: compare/gate refuse by default across different judge models/prompt hashes; --allow-mixed-judges override; output always labeled (RW-14)

Generation metrics

P3-07 generation.groundedness: claim decomposition, per-claim verdicts in judge_raw_json; zero-claim answers → not_applicable (RW-9)
P3-08 generation.citation_validity (deterministic; duplicate-ID semantics defined and tested)
P3-09 generation.citation_support (judge)
P3-10 generation.answer_relevance (judge)
P3-11 generation.completeness (judge; skipped-with-reason when no expected_answer)

Cost

P3-12 Cost accounting: provider-reported usage preferred, tokenizer estimate fallback; per-run cost in summary
P3-13 Budget enforcement: check before each call vs RAGPROOF_MAX_COST_USD; graceful stop → status aborted:budget, partial results persisted, exit 2 (RW-6)

Calibration

P3-14 ≥10 human-scored calibration fixtures per judge prompt in judge/fixtures/
P3-15 ragproof calibrate: exact + within-1-band agreement report; thresholds in config
P3-16 CI job: run calibration when judge/prompts/** or judge/fixtures/** change; fail below threshold (spec §7.4)

Tests

P3-17 All judge unit tests use recorded fixtures - no live LLM calls in CI
P3-18 Planted good/bad answers on example corpus separate cleanly on groundedness
P3-19 Malformed-judge-output path test: repair retry → judge_error → run completes, error visible in summary
P3-20 Cache reproducibility test: unchanged re-run ≈ $0 and byte-identical judge-metric scores
P3-21 Budget-breach test: mid-run stop, partial persisted, exit 2

Acceptance: all P3 tests green; calibration gate demonstrated on a deliberately bad prompt change; tag phase-3-complete.

Phase 4 - Dataset generation

Acceptance: generate on tiny corpus yields spot-checkable answerable QA cases with reported discard rate; immutability enforced; tag phase-4-complete.

Phase 5 - Robustness metrics

P5-01 Payload library, 10+ types: instruction override, exfiltration URL (*.invalid only), tone hijack, citation spoofing, system-prompt fishing, formatting hijack, steering, fake-citation injection, link-bait, chained instructions
P5-02 Per-payload deterministic compliance detectors (string/regex), each with positive and negative fixture tests
P5-03 Payload safety lint test: inert markers only, *.invalid/example.com URLs only, no shell commands or real endpoints - runs in the standard suite (RW-17)
P5-04 robustness.injection_resistance = 1 − compliance rate
P5-05 robustness.abstention on unanswerable cases: refusal heuristic + judge confirmation
P5-06 robustness.overrefusal on answerable cases; reported side by side with abstention (RW-5)
P5-07 Fabrication-on-unanswerable weighted prominently in summaries (spec §7.3)
P5-08 examples/: deliberately vulnerable pipeline + guarded pipeline
P5-09 Integration test: vulnerable scores low / guarded scores high on injection resistance; always-refusing pipeline shows high abstention and high overrefusal

Acceptance: all detectors fixture-tested; safety lint in CI; example-pipeline contrast asserted; tag phase-5-complete.

Phase 6 - Reports, CI gate, distribution

Reports

P6-01 HTML report: single self-contained file - overview, per-metric distributions, run comparison, worst-10 cases per metric (question/answer/context/judge reasoning), skip/error counts, cost, dataset/config/prompt hashes
P6-02 Vendor Chart.js into the template with license header; zero network requests - automated check greps report for external resource loads (RW-13)
P6-03 Markdown summary (PR-comment sized)
P6-04 JUnit XML: one test per metric; execution errors → <error>, threshold breaches → <failure> (RW-4)

Gate

P6-05 ragproof gate: absolute thresholds + relative-to-baseline deltas
P6-06 Per-metric noise_floor config; bootstrap 95% CIs on judge-backed metrics; in-noise deltas warn instead of fail (RW-3)
P6-07 on_missing: fail|skip behavior for skipped metrics (default fail) (RW-10)
P6-08 Minimum-sample warning when n < 30 (RW-3)
P6-09 Exit-code contract enforced end to end (0/1/2/3) with tests
P6-10 --json output on run/compare/gate

Distribution

P6-11 action.yml reusable GitHub Action: install → run → gate → upload HTML artifact → sticky PR comment with Markdown summary
P6-12 Dockerfile: multi-stage, slim, non-root user, pinned base digest
P6-13 Dogfood: the Action runs in this repo's own CI against the example pipeline
P6-14 Integration tests: in-noise delta passes gate; genuine regression fails it; HTML opens from disk offline

Acceptance: gate exits correctly in CI with native JUnit rendering; Action end-to-end green incl. PR comment; tag phase-6-complete.

Phase 7 - Case studies and launch

Acceptance: both case studies show real numbers + ≥1 real improvement; pip install ragproof → 1.0.0; tag v1.0.0.

Phase 8 - Web UI (design: UI_PLAN.md)

UI-0 Foundation

P8-01 frontend/ workspace: Vite, React 18, TypeScript strict, Tailwind v4, shadcn/ui; ESLint + Prettier aligned with repo style
P8-02 Design tokens from UI_PLAN §5 (color, type, spacing) with dark mode as a first-class theme; self-hosted Inter + JetBrains Mono
P8-03 FastAPI server in ragproof/ui/ behind the ragproof[ui] extra; binds 127.0.0.1 by default, warning on --host override
P8-04 ragproof ui command: starts server, opens browser, --dev proxies Vite; clear install hint when the extra is missing
P8-05 Build pipeline: CI builds the bundle into ragproof/ui/static/, wheel ships it, Python jobs never need Node
P8-06 Bundle scan test: no external URLs in built assets; CSP default-src 'self'
P8-07 UI CI job: typecheck, lint, Vitest, build

UI-1 Runs and Run detail

P8-08 Read API: /api/meta, /api/projects, paginated /api/runs, /api/runs/{ref} reusing reports/data.py
P8-09 Runs table: status dot, label, relative time, case counts, pinned ScoreCells with micro-distributions, delta vs selected baseline
P8-10 Column picker persisted per project; two-row select enables Compare
P8-11 Run detail Overview: header chips (judge, hashes, cost, cache), metric cards with histograms and threshold lines, worst-cases strip
P8-12 Metadata tab; polling for running runs
P8-13 Loading, empty (teaches the CLI command), error, and partial states on every screen
P8-14 Consistency test: /api/runs/{ref} equals CLI --json on the same store

UI-2 Case triage

P8-15 /api/runs/{ref}/cases with filters, sort, cursor pagination; /cases/{key} detail
P8-16 Virtualized cases grid with metric columns, worst-first sort, status/kind filters
P8-17 Routed case side panel: question, answer, retrieved chunks with cited highlighted, per-claim verdict checklist from judge_raw, raw JSON CodeBlock
P8-18 Keyboard triage loop: j/k, Esc, deep links restore filter + selection

UI-3 Compare and Trends

P8-19 /api/compare + Compare screen: delta table with CI whiskers, mixed-judge blocking banner, dataset mismatch warning
P8-20 Changed-cases diff grid, worst regression first; split case view baseline vs candidate
P8-21 /api/trends + Trends screen: mean per run, 0-1 domain, threshold lines, verdict-colored points, click-through to runs

UI-4 Gate, Datasets, Calibration

P8-22 Gate tab rendering GateOutcome via /api/runs/{ref}/gate; identical verdicts to the CLI asserted in CI
P8-23 Datasets list and detail (generation metadata, case browser, runs over dataset)
P8-24 Calibration screen with agreement bars vs thresholds
P8-25 Command palette (Ctrl/Cmd-K): jump to run, case, screen

UI-5 Polish and release

P8-26 A11y pass: AA contrast both themes, focus order, reduced motion, no color-only status
P8-27 Performance pass against a 1,000-run seeded store; bundle under 300KB gzipped, route code-splitting
P8-28 Playwright smoke in CI: boot ragproof ui, walk Runs -> Run -> Case -> Compare
P8-29 docs/ui.md; README screenshots and GIF
P8-30 Ship: wheel includes bundle, ragproof[ui] extra documented, tag phase-8-complete

Acceptance: ragproof ui gives a keyboard-operable, dark-mode-native dashboard whose every number matches the CLI on the same store; quality bar in UI_PLAN §8 met.

UI-6 Control panel (overrides read-only v1 per user request)

P8-31 Background jobs subsystem: async JobManager with redacted log capture, terminal status, eviction, drain on shutdown
P8-32 Actions API reusing engine/generate/freeze/calibrate/check/report as jobs; jobs and config endpoints; report artifacts with path-traversal guard
P8-33 Frontend: New run + actions menu, Jobs screen with live logs, Config viewer, per-run Re-run and Report, palette actions, sidebar entries
P8-34 9 action/jobs API tests; verified live in the browser (run, report download, traversal block, jobs render)
P8-35 docs/ui.md control-panel section; security posture documented (localhost, view-only config)

Acceptance: the dashboard starts runs, generates datasets, calibrates, checks, and builds downloadable reports as background jobs with live logs, all reusing the CLI code paths.

Cross-cutting (every phase, every PR)

Conventional commits; small reviewable diffs
Coverage ≥ 85% on metrics/, engine.py, judge/
docs/metrics.md updated in the same PR as any scoring change
.env.example updated in the same PR as any new env var
PROGRESS.md updated at each phase end
All randomness via seeded random.Random, seeds recorded
End-of-phase tag phase-N-complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAGProof - Task List

Phase 0 - Foundation

Phase 1 - Adapter layer, run store, run engine

Adapters

Run store

Engine

Config, checks, security

Tests

Phase 2 - Retrieval metrics + compare

Phase 3 - Judge layer + generation metrics

Judge client

Generation metrics

Cost

Calibration

Tests

Phase 4 - Dataset generation

Phase 5 - Robustness metrics

Phase 6 - Reports, CI gate, distribution

Reports

Gate

Distribution

Phase 7 - Case studies and launch

Phase 8 - Web UI (design: UI_PLAN.md)

UI-0 Foundation

UI-1 Runs and Run detail

UI-2 Case triage

UI-3 Compare and Trends

UI-4 Gate, Datasets, Calibration

UI-5 Polish and release

UI-6 Control panel (overrides read-only v1 per user request)

Cross-cutting (every phase, every PR)

FilesExpand file tree

TASKS.md

Latest commit

History

TASKS.md

File metadata and controls

RAGProof - Task List

Phase 0 - Foundation

Phase 1 - Adapter layer, run store, run engine

Adapters

Run store

Engine

Config, checks, security

Tests

Phase 2 - Retrieval metrics + compare

Phase 3 - Judge layer + generation metrics

Judge client

Generation metrics

Cost

Calibration

Tests

Phase 4 - Dataset generation

Phase 5 - Robustness metrics

Phase 6 - Reports, CI gate, distribution

Reports

Gate

Distribution

Phase 7 - Case studies and launch

Phase 8 - Web UI (design: UI_PLAN.md)

UI-0 Foundation

UI-1 Runs and Run detail

UI-2 Case triage

UI-3 Compare and Trends

UI-4 Gate, Datasets, Calibration

UI-5 Polish and release

UI-6 Control panel (overrides read-only v1 per user request)

Cross-cutting (every phase, every PR)