ïdea Bench

ïdea Bench is a self-hosted tool for running blind head-to-head evaluations of LLM output with real voters or simulated personas. Build a campaign, compare models, system prompts, or prompt variants, and turn the votes into Bradley-Terry ratings you can defend in a meeting. Works with any OpenRouter-backed model.

Status: v0.1.0 — public alpha. The single-operator self-hosted loop is real: create campaigns, generate or paste contestant outputs, collect blind votes, and compute ratings. Team workspaces, billing, and hosted SaaS operations are deliberately out of scope.

The voting surface proves the central trust promise: voters can compare outputs without seeing the model, prompt, or contestant identity.

Watch the 30-second trailer · Read the changelog · Reproduce the rating proof · Read the failure modes

What this is not

Not a hosted evaluation SaaS or public benchmark leaderboard. You run it against your own Postgres database and model provider.
Not a replacement for human judgment. It gives you structured preference evidence; you still decide what the evidence means.
Not multi-tenant team software yet. One operator per deployment is the current product shape.

What is real vs. simulated

Area	Status	What is real	What is simulated or absent
Campaign loop	Real	Create campaigns, add contestants, activate voting links, collect votes, recompute ratings.	Demo seed data is synthetic.
Blindness	Real in app/API	Voters do not receive model, prompt, or contestant identity before reveal.	Operator-provided text can still leak identity if it names the model.
Ratings	Real	Pairwise votes run through the Bradley-Terry solver with confidence intervals.	Early samples can still be directional; see failure modes.
Simulated personas	Real feature	Operators can run model-judged simulated votes when AI spend is enabled.	Simulated votes are not human preference evidence.
Hosted/team SaaS	Out of scope	Self-hosted single-operator deployments.	Workspaces, billing, RBAC, and shared-team governance.

Choose your path

Run a private model evaluation: start with the quickstart and seed data.
Check the trust model: read the blind voting, operator auth, AI spend, and failure modes.
Verify the math: run the deterministic rating proof.
Work on the code: read AGENTS.md and the command registry before changing server or database paths.

What it does

Blind A/B voting. Voters compare two generations side by side. The model name, system prompt, and any contestant metadata are hidden until the campaign closes — there is no path to leak identity through the DOM or the API.
Three kinds of contestants in one engine. Compare models against each other, compare system prompts on a fixed model, or compare prompt variants. Same blind-voting UI, same rating math.
Bradley-Terry ratings + group alignment. Pairwise votes feed a Bradley-Terry maximum-likelihood model. The campaign view shows ratings, confidence intervals, and how each voter group aligned with the overall result.

The campaign dashboard turns pairwise votes into ratings with confidence intervals and voter-group alignment, so the result can survive a real decision meeting.

How the loop works

flowchart LR
  A["Prompt + contestants"] --> B["Blind ballot"]
  B --> C["Human or simulated votes"]
  C --> D["Bradley-Terry rating"]
  D --> E["Decision-ready evidence"]
  E --> F["Reveal model identities"]

Quickstart

git clone https://github.com/Christian-Katzmann/idea-bench.git
cd idea-bench
npm install
cp .env.example .env.local
# Fill in DATABASE_URL, OPERATOR_PASSWORD, AUTH_SECRET.
# Generate AUTH_SECRET with:  openssl rand -hex 32

npm run db:migrate     # apply schema
npm run db:seed        # load demo campaigns (destructive; dev DB only)
npm run dev            # http://localhost:3000

db:seed prints the share slugs it created — jump straight into the participant flow at http://localhost:3000/vote/<slug>.

Database. Any normal Postgres works: local Docker, Supabase, Neon, RDS, or another managed host with a postgres:// URL. Neon is a good Vercel-friendly option, but it is not required.

Models. OpenRouter is the default provider — one API key, any model. $5 of credit goes a long way for evaluation work.

Prerequisites. Node.js 20+ (24 LTS recommended), Postgres, OpenRouter API key (for AI features).

Operator auth

Three sign-in methods, all issuing the same operator_session cookie (HMAC-signed, 30-day expiry). Enable the ones you want by populating the relevant env vars; anything unset stays hidden in the UI.

Method	Env vars	Notes
Password	`OPERATOR_PASSWORD`	Always available. Constant-time compare + 400ms delay on mismatch.
GitHub OAuth	`GITHUB_OAUTH_CLIENT_ID`, `GITHUB_OAUTH_CLIENT_SECRET`, `OPERATOR_GITHUB_LOGINS`	Register an OAuth App at `github.com/settings/developers` with callback `${origin}/api/auth/github-callback`. The allowlist matches either the GitHub login or any verified email on the account.
Email magic link	`OPERATOR_EMAILS`, `RESEND_API_KEY`, optional `RESEND_SENDER_ADDRESS`	Resend-backed. 15-min single-use tokens; only `sha256(token)` is stored. Sender defaults to the Resend sandbox; set `RESEND_SENDER_ADDRESS=auth@your-domain` once your domain is verified in Resend.

Per-IP rate limiting (5 attempts / 15 min) guards the OAuth callback and the magic-link send/verify endpoints. Rotating AUTH_SECRET invalidates every outstanding cookie.

AI spend gate

Login and AI spend are two separate allowlists:

OPERATOR_* decides who can sign in.
AI_ALLOWED_IDENTITIES decides which of those signed-in operators can trigger OpenRouter calls (generate, simulated-runs/run, personas/test).

Comma-separated, matched case-insensitively against the session's identity field. Empty or unset fails closed — AI endpoints return 503 ai_not_configured instead of opening up. Password sessions have identity 'operator' (a shared literal, not a person), so password logins are implicitly blocked from AI; sign in with GitHub or email when you need to spend.

Architecture

Frontend. Vite SPA in src/ (React + TypeScript). Tailwind + a small in-repo design system; see docs/design-system/DESIGN-SYSTEM.md.
API. Vercel Functions in api/, deployable as Fluid Compute. Most routes flow through a single dispatcher.
Database. Postgres via postgres + Drizzle ORM. Schema in src/server/db/schema.ts, migrations in drizzle/.
Server boundary. Domain logic, auth, OpenRouter integration, and rating math all live in src/server/ — strictly server-only. Client code must not import from src/server/**. See src/server/README.md for the contract.

Scripts

Script	Purpose
`npm run dev`	Start Vite dev server.
`npm run verify`	Run typecheck, Vitest, and production build.
`npm run build`	Production build.
`npm run lint`	`tsc --noEmit`.
`npm run test:run`	Run the Vitest suite once.
`npm run db:generate`	Diff schema → new SQL migration in `drizzle/`.
`npm run db:migrate`	Apply pending migrations to `DATABASE_URL`.
`npm run db:push`	Push schema directly to `DATABASE_URL` (dev only — skips migration history).
`npm run db:studio`	Launch Drizzle Studio.
`npm run db:seed`	Wipe and re-seed demo data. Refuses to run with `NODE_ENV=production` unless `ALLOW_PROD_SEED=1`.
`npm run db:seed-starter-personas`	Idempotently seed the curated starter persona library from `data/starter-personas.json`.

Roadmap

Where ïdea Bench is going next: docs/roadmap/.

Optional

Mac Dock launcher. docs/desktop-launcher.md — wrap the dev server as a clickable .app.

Contributing & license

See CONTRIBUTING.md for the dev loop, lint/test expectations, and branch hygiene.
Security reports go to the address in SECURITY.md.
Licensed under the terms in LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.adx		.adx
.github		.github
api		api
assets		assets
data		data
design		design
docs		docs
drizzle		drizzle
public		public
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.socraticodecontextartifacts.json		.socraticodecontextartifacts.json
.socraticodeignore		.socraticodeignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md
SECURITY.md		SECURITY.md
components.json		components.json
drizzle.config.ts		drizzle.config.ts
index.html		index.html
llms.txt		llms.txt
metadata.json		metadata.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vercel.json		vercel.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ïdea Bench

What this is not

What is real vs. simulated

Choose your path

What it does

How the loop works

Quickstart

Operator auth

AI spend gate

Architecture

Scripts

Roadmap

Optional

Contributing & license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ïdea Bench

What this is not

What is real vs. simulated

Choose your path

What it does

How the loop works

Quickstart

Operator auth

AI spend gate

Architecture

Scripts

Roadmap

Optional

Contributing & license

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages