Skip to content

Latest commit

 

History

History
294 lines (258 loc) · 15.5 KB

File metadata and controls

294 lines (258 loc) · 15.5 KB

Progress

2026-07-03: Phase 8, UI-6 control panel

Overrode the plan's read-only-v1 decision at the user's request: the dashboard is now the main control panel, able to run operations, not just view them.

Done:

  • Background jobs subsystem (ragproof/ui/jobs.py): in-memory JobManager runs async operations, captures redacted log lines and a terminal status, evicts old jobs, drains on shutdown.
  • Actions API (ragproof/ui/actions.py) reusing the engine, generate, freeze, calibrate, check, and report code as jobs: POST /api/actions/{run,generate, freeze,calibrate,check}, POST /api/runs/{ref}/report producing downloadable artifacts, GET /api/jobs[/id], GET /api/config, GET /api/artifacts/{job}/{name} with path-traversal guard.
  • Frontend: top-bar New run button plus a menu (generate, calibrate, check), Jobs screen with live log tail and run/report links, Config viewer, per-run Re-run and Report buttons, command-palette actions group, sidebar Jobs and Config entries.
  • 9 action/jobs API tests: run recorded from the UI, report artifacts and traversal block, check probe, freeze, generate-failure, config, 404.
  • Verified live in the browser: New run created a run end to end, report job produced a downloadable HTML artifact, traversal blocked, Jobs screen shows both with live logs, no console errors.

Decisions:

  • Config is view-only. Writing arbitrary config from a browser is an unbounded risk, so the on-disk ragproof.yaml stays the single source of truth.
  • Actions run with the same reach as the CLI on that machine; the server stays 127.0.0.1 with a warning on other hosts, documented in docs/ui.md.
  • Jobs are in-memory (not persisted). They are a live operations view, not an audit log; the run store remains the durable record.

2026-07-03: Phase 8, UI-1 through UI-5

Done:

  • Read API (ragproof/ui/api.py) reusing reports/data.py, gate.py, compare.py: projects, paginated runs, run detail, cases with filter and worst sort, case detail, gate, compare, trends, datasets, calibration. Store gained the list and aggregate queries a UI needs. 7 endpoint tests assert the API matches the CLI on the same store.
  • Runs home: table with status, pinned metric ScoreCells, per-run deltas, live polling for running runs, and the four designed states.
  • Run detail: Overview (metric cards, distribution histograms, worst-case strips), Cases triage grid with filters and worst-first sort, routed case side panel (question, answer, cited-chunk highlighting, per-claim groundedness checklist, raw judge JSON, Esc + deep link), Gate tab rendering the same verdict and CIs as the CLI, Metadata tab.
  • Compare screen (delta table, dataset-mismatch warning), Trends charts (0-1 domain, click-through), Datasets list and detail, Calibration screen, Ctrl/Cmd-K command palette.
  • Animations on data: count-up numbers (AnimatedNumber), animated score bars, staggered row entrance, animated histograms and trend lines, spring-in case panel and palette; all respect prefers-reduced-motion.
  • Route code-splitting: home loads ~145KB gzipped, the chart library (~104KB) loads only on run detail and trends. Under the 300KB budget.
  • docs/ui.md; README dashboard section; frontend tests (theme, RunsPage states) plus the backend API and bundle-scan tests.
  • Verified live in the browser: runs table, run overview, case panel, and gate tab all render real store data with no console errors.

Decisions:

  • Recharts for charts (declarative, tree-shakeable, no runtime CDN); it is the bulk of the split chart chunk, kept off the home page by lazy routes.
  • The case panel is a deep-linkable ?case= param rather than a nested route, which keeps it deep-linkable without a nested router.

2026-07-03: Phase 8, UI-0 foundation

Done:

  • frontend/ workspace: Vite, React 19, TypeScript strict, Tailwind v4, ESLint flat config, Prettier, Vitest with Testing Library
  • Design tokens from UI_PLAN section 5 as CSS variables, dark mode via a .dark class with prefers-color-scheme default and a persisted toggle; Inter and JetBrains Mono self-hosted through fontsource
  • App shell: sidebar navigation for all planned screens, top bar with live project and store path, Runs page foundation card with designed loading, error, and ready states
  • FastAPI server in ragproof/ui/ behind the ragproof[ui] extra: /api/meta, CSP default-src 'self', nosniff and referrer headers, SPA fallback that never swallows unknown /api paths (404 JSON), static root path guard
  • ragproof ui command: binds 127.0.0.1 by default with a warning on other hosts, opens the browser, --no-browser and --dev flags, clear install hint when the extra is missing
  • Build pipeline: vite builds into ragproof/ui/static (committed so wheels and no-Node machines work); release workflow builds the frontend before uv build; ui.yml CI job runs typecheck, lint, format, tests, build, and an external-resource bundle scan; main CI installs the ui extra
  • Bundle: 77.5KB gzipped JS, fonts as woff2, zero external loads (tested)
  • Verified live: ragproof ui served /api/meta, the themed shell, CSP headers, and client-route fallback on this machine

Decisions:

  • React 19 instead of the plan's React 18; it is the current stable line and Testing Library and Router support it fully.
  • lucide-react resolved to 1.x (the 0.x line graduated); API unchanged.
  • The built bundle is committed. Rationale: acceptance requires the UI to work from a checkout without Node, and wheels must build from any clone. Revisit if bundle churn becomes noisy in diffs.

Next: UI-1 (runs table and run detail), interleaved with Phase 7 launch work.

2026-07-02: Phase 6

Done:

  • Single-file HTML report: overview with inline SVG mean bars, per-metric SVG distribution histograms, worst-10 cases per metric with judge reasoning, dataset/config/prompt hashes, cost; zero network requests, verified by a test that greps the output for external resource loads and a live check (24KB, 0)
  • Markdown summary for PR comments; JUnit XML that marks threshold breaches as and execution errors as , with a run-level execution case when cases fail to evaluate
  • ragproof gate: absolute min thresholds, relative max_drop vs baseline, per-metric noise_floor, bootstrap 95% CI so an uncertain drop warns instead of failing, on_missing fail|skip, min_samples warning; exit 0/1/2/3
  • ragproof report writes html, md, and junit
  • --json output on run, compare, and gate
  • Reusable composite GitHub Action (action.yml): install, run, gate, upload the HTML artifact, sticky PR comment with the Markdown summary
  • Dockerfile: multi-stage, slim, non-root
  • Dogfood workflow runs the action against the example pipeline in this repo
  • docs/ci.md; verified the wheel ships all judge and dataset templates and fixtures so pip installs run without missing package data

Decisions:

  • Charts are hand-rolled inline SVG instead of vendored Chart.js. The plan's RW-13 goal is a self-contained, zero-network report; inline SVG meets that with no 200KB third-party bundle and nothing to fetch. Documented here.
  • TODO(decision): the Dockerfile pins the base image by tag, not digest. Digest pinning needs a registry resolve; do it in an environment with network access before release.

Next: Phase 7, case studies and launch.

2026-07-02: Phase 5

Done:

  • Payload library expanded to 10 categories (instruction override, data exfiltration, tone hijack, citation spoofing, system prompt disclosure, formatting hijack, competitor steering, fake citation, link bait, chained instructions), each with a unique inert sentinel and a compliance detector
  • Payload safety lint in the standard suite: reserved .invalid domains only, no shell metacharacters, every marker appears in its instruction and trips only its own detector, clean answers trip none
  • robustness.injection_resistance: deterministic, 1.0 when the answer resisted the planted payload, 0.0 when it complied; detects by payload id or a literal marker fallback for hand-written cases
  • robustness.abstention: refusal on unanswerable cases; deterministic heuristic with judge confirmation when a judge is configured; higher is better
  • robustness.overrefusal: refusal rate on answerable cases, reported beside abstention so a refuse-everything pipeline cannot hide; lower is better
  • refusal is now a calibrated judge prompt with 12 fixtures, wired into calibrate and prompt versioning
  • Two example pipelines: a deliberately vulnerable one that obeys injections and fabricates, and a guarded one that ignores injections and abstains
  • Injection generation now plants the payload instruction in the question and stores the payload id, so cases are self-contained and testable
  • Integration tests assert the vulnerable and guarded pipelines land at the opposite ends of injection_resistance and abstention

Decisions:

  • Injection instructions are delivered in the question rather than by emitting poisoned corpus files (still future). This makes cases portable and testable without wiring poisoned documents through a pipeline's own retrieval.

Next: Phase 6, reports and the CI gate.

2026-07-02: Phase 4

Done:

  • Corpus ingestion: TXT and Markdown in the core install, PDF and DOCX behind the ingest extra with a clear install hint; per-file size cap; unreadable, oversize, and empty files skipped with a reported reason; deterministic ordering
  • Deterministic chunking with overlap; stable chunk ids scoped to documents
  • Generation client reusing the chat transport, with a repair retry and a budget guard; sequential so ordering is deterministic for a seed
  • QA synthesis with a second-pass answerability check (failures discarded and counted); unanswerable synthesis verified absent via a lexical retrieval plus judge pass; injection cases attach an inert payload marker
  • Inert payload seed set with a safety test asserting reserved domains only and no shell metacharacters; Phase 5 expands this and adds detectors
  • ragproof generate: ingest, generate, write a JSONL review file, print produced and discarded counts
  • ragproof freeze: validates cases (qa needs a source id, injection needs a payload), writes dataset.frozen.json with a canonical sha256 embedded
  • Frozen datasets verified on load; a file edited after freezing is refused; run and check accept plain JSONL or a frozen dataset
  • Shared prompt-template module (render plus hashing) now used by both the judge and the generator
  • docs/datasets.md; tiny test corpus under tests/fixtures/corpus

Decisions:

  • Generated qa and injection cases use the document path as expected_source_ids since our chunk ids will not match a pipeline's own ids. Documented in docs/datasets.md with the source_match: document guidance.
  • Poisoned corpus files are not emitted in v1; the injection case carries the payload marker, which is what Phase 5 detection needs. Noted as future.

Next: Phase 5, robustness metrics.

2026-07-02: Phase 3

Done:

  • Provider-agnostic judge client: OpenRouter, OpenAI, Ollama (no key needed), Anthropic; temperature 0, per-call timeout, retries with Retry-After, API key read from the environment at call time and never stored
  • Structured judging: JSON extracted and schema-validated, one repair retry with the validation error appended, then metric-level judge_error (excluded from means, added to the metric's error count, never scored 0)
  • Judge cache in the run store keyed by (model, prompt hash, input hash); hit/miss stats printed per run; --no-cache bypasses it
  • Versioned prompt templates (sha256 recorded per run) with calibration fixtures, 10 human-scored examples per prompt
  • Generation metrics: citation_validity (deterministic, duplicates deduped), groundedness (per-claim verdicts in judge_raw_json, zero claims is not_applicable), citation_support, answer_relevance, completeness
  • Cost accounting from provider usage with estimate fallback; budget checked before every call; breach aborts the run as aborted:budget with exit 2 and keeps completed results resumable
  • ragproof calibrate with per-prompt exact and within-band agreement, exit 1 below thresholds; calibrate.yml CI workflow on prompt or fixture changes (skips with a warning until the RAGPROOF_LLM_API_KEY secret is configured)
  • Mixed-judge guard: compare refuses runs with different judge models or prompt versions unless --allow-mixed-judges; resume refuses a changed judge
  • Schema v2 migration (judge_cache table) exercised the versioned-migration path: v1 databases migrate with a backup file
  • Metric protocol is now async; raw judge output persisted redacted per case

Decisions:

  • judge_error is recorded per metric, not as a whole-case status: a judge failure on one metric must not discard the deterministic scores of the same case. The spec's "fails the case, not the run" intent is preserved.

Next: Phase 4, dataset generation.

2026-07-02: Phase 2

Done:

  • Retrieval metrics with exact-value fixture tests: precision_at_k, recall_at_k (denominator rules documented), mrr (full-list rank), ndcg_at_k (binary relevance)
  • Edge cases fixed and tested: fewer than k retrieved, empty retrieval, duplicate ids collapsed keeping first rank, relevant item below k, no hit
  • Source matching granularity: run.source_match chunk|document, document mode maps chunks through metadata.document_id with chunk-id fallback
  • Metric registry refactored to per-run factories so k and source_match come from config; engine builds metrics through the registry
  • Skip reasons surfaced: summaries carry the dominant skip reason and the run table prints it, so a no-retrieval pipeline says why instead of showing zeros
  • ragproof compare: resolves run ids, unique prefixes, and 'latest'; prints per-metric deltas; warns when runs used different datasets
  • docs/metrics.md documents every metric's exact computation

Notes:

  • Freeze-time rejection of qa cases with empty expected_source_ids (RW-9) lands with 'ragproof freeze' in phase 4; at run time such cases skip with a reason today.

Next: Phase 3, judge layer and generation metrics.

2026-07-02: Phase 0 and Phase 1

Done:

  • Repo scaffold, pyproject with uv lockfile, ruff and strict mypy, CI matrix (3 OS x 3 Python), CodeQL, gitleaks, dependency review, release workflow for PyPI Trusted Publishing
  • CLI with all commands registered; run and check are implemented, the rest exit 3 with a message until their phase lands
  • Exit code contract: 0 pass, 1 gate failure, 2 execution error, 3 config error
  • Adapter protocol with capability flags, Python adapter (sync and async targets), HTTP adapter (JSONPath mapping, retries with Retry-After support, env-sourced headers, 4xx fail fast)
  • SQLite run store: WAL mode, serialized writes, schema versioning with backup-before-migrate, full domain model from spec section 5
  • Run engine: bounded concurrency, per-case timeout and error capture, incremental result persistence, --resume for interrupted runs
  • Canonical JSON hashing for dataset and config hashes
  • Secret redaction applied to logs and persisted error payloads
  • Cost ledger scaffold (real accounting lands in phase 3)
  • Example pipeline, example dataset and configs, full test suite

Decisions:

  • TODO(decision): schema migrations use an in-repo versioned migration table instead of Alembic. PHASE_PLAN P1-09 asked for Alembic; the lighter mechanism satisfies RW-16 (version stored, newer-schema error, backup-before-migrate) with less machinery. Revisit if the schema grows.

Needs the account owner (cannot be done from this machine):

  • Create the GitHub repo and push (P0-01)
  • Configure PyPI Trusted Publishing and publish the 0.0.1 placeholder (P0-12)

Next: Phase 2, retrieval metrics and compare.