Skip to content

Conversation

@bradleyshep
Copy link

@bradleyshep bradleyshep commented Oct 25, 2025

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

  • CLI: llm with subcommands run, routes list, diff, ci-check.
  • Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
  • Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
  • Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
  • Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
  • Build & guards: build script for compile-time setup;
  • Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

  1. Pick what to run

    • Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
    • Optionally limit vendors/models (--providers …, --models …).
  2. Resolve routes

    • Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
  3. Build context

    • Start Spacetime
    • Publish golden answer modules
    • Prepare prompts and send to LLM model
    • Attempt to publish LLM module
  4. Execute calls

    • Run the selected tasks within each test against selected models and languages.
  5. Score outputs

    • Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
    • Record the score and any short failure reason.
  6. Update results file

    • Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

  • External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

  • llm run --tasks 0,1,2 (explicit run)
  • llm run --lang rust --categories basics (filters)
  • llm run --categories basics,schema (multiple categories)
  • llm run --lang csharp (language switch)
  • llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5" (provider/model limits)
  • llm run --hash-only (dry integrity)
  • llm run --goldens-only (test goldens only)
  • llm run --force (skip hash check)
  • llm ci-check
  • Stats viewer loads the JSON; filtering and CSV export work
  • CI works as intended

@bfops bfops added the release-any To be landed in any release window label Oct 27, 2025
bradleyshep and others added 6 commits November 3, 2025 13:11
…ain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: bradleyshep <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-any To be landed in any release window

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants