Which AI models write the best nonfiction?
ZinsserBench is a benchmark that tests how well language models write the kinds of nonfiction real people actually read: memos, explainers, profiles, how-to guides, opinion pieces, and personal essays. A panel of frontier AI judges scores every response on clarity, simplicity, brevity, structure, specificity, voice, and overall effectiveness.
The most recent completed run (2026-03-08-openrouter-v0-2-clean-3, benchmark version v0.2) tested 12 candidate models across 20 prompts. Scores are on a 1-to-5 scale. Higher is better.
The primary writing metric is now criteria average: for each judged response, ZinsserBench averages the six rubric criteria other than overall, then averages those item-level means across the benchmark for each model. The judges' explicit overall score is still reported as overall average, but it is now treated as a secondary diagnostic.
Top three by criteria average:
| Rank | Model | Criteria average |
|---|---|---|
| 1 | anthropic/claude-sonnet-4.6 | 4.81 |
| 2 | anthropic/claude-opus-4.6 | 4.79 |
| 3 | openai/gpt-5.3-chat | 4.79 |
Full results are in the detailed report.
The explicit overall score is still useful, but mostly as a check on how much judges' holistic impressions differ from the six scored criteria.
In the current v0.2 run, most models receive slightly higher overall scores than their criteria-based averages. That makes overall a useful diagnostic signal, but a weak choice for the headline leaderboard if the goal is to anchor rankings in the explicit rubric dimensions.
To see where each model is strong or weak, the analysis also includes an axis heatmap based on the averaged rubric criteria.
This is one of the most useful views in the report because it shows strengths and weaknesses by criterion, not just a single rolled-up score. The heatmap uses one global score-to-color scale across all cells, so the same numeric score always appears with the same color.
ZinsserBench also measures how closely each judge matches the rest of the panel. In this run, z-ai/glm-5 had the highest agreement by a hair, essentially tied with google/gemini-3.1-pro-preview, ahead of openai/gpt-5.4.
This is the first clean v0.2 run with a four-judge panel and same-company judgments excluded from scoring. It is much more reliable than the earlier salvage run, but it is still an early benchmark and should be treated as directional rather than definitive.
- Clean run. This leaderboard comes from a full fresh run at
max_output_tokens=10000, not a repaired salvage copy. - No outputs were excluded from scoring. There were
0quarantined outputs in the final analysis. - Judge panel. The panel is
openai/gpt-5.4,anthropic/claude-opus-4.6,google/gemini-3.1-pro-preview, andz-ai/glm-5. - Same-company judgments are skipped. The analysis records
140skipped same-company judgments, and no candidate/prompt item was excluded for insufficient remaining judges. - A small amount of sanitization still happens.
qwen/qwen3.5-35b-a3btriggered6light sanitization warnings for leaked thinking prefixes. These were stripped before judging. - Prompt count. 20 prompts across 6 categories. Enough to show meaningful patterns, not enough to claim high statistical precision.
The benchmark is named for William Zinsser (1922-2015), journalist, editor, teacher at Yale, and author of On Writing Well, one of the most widely read books on the craft of nonfiction. Zinsser argued that good nonfiction should be lucid, economical, concrete, and alive on the page. Strip the clutter. Use plain words. Be specific. Sound human.
ZinsserBench does not ask models to imitate Zinsser's voice. It uses his recurring principles as a practical standard for judging modern AI writing. The rubric scores seven dimensions:
| Dimension | What it means |
|---|---|
| Clarity | Easy to understand on a first read |
| Simplicity | Plain, direct language without puffed-up wording |
| Brevity and economy | No wasted space, repetition, or throat-clearing |
| Structure and flow | Ideas arrive in a logical, readable order |
| Specificity and precision | Concrete details instead of vague abstraction |
| Humanity and voice | Sounds written for people, not by committee |
| Overall effectiveness | Overall nonfiction quality for the task |
The goal is not literary imitation. The goal is strong public-facing prose.
A great deal of real-world AI writing is nonfiction: memos, consumer explainers, civic guides, service writing, profiles, and opinion pieces. These are everyday forms that people read to understand work, institutions, money, health, policy, and one another.
Nonfiction is also where weak writing habits become obvious. Models can hide behind flourish in fiction; they have less room to hide when they need to explain, persuade, or inform plainly.
- ZinsserBench gives every candidate model the same set of nonfiction writing prompts.
- The prompts cover several common forms: explainers, internal memos, profiles, practical how-to guidance, opinion writing, and personal nonfiction.
- A separate judge panel scores every response on the rubric above.
- The repo aggregates those scores into three headline views:
- Criteria average -- the primary writing score, based on the six explicit rubric criteria and excluding the separate
overallfield. - Overall average -- a secondary diagnostic based on judges' explicit overall scores.
- Judge quality -- how closely a judge agrees with the rest of the panel.
- Criteria average -- the primary writing score, based on the six explicit rubric criteria and excluding the separate
This makes the benchmark useful for two questions: Which models write the strongest nonfiction? and Which models are the most reliable judges of nonfiction quality?
Version v0.1 uses 20 prompts across six categories:
| Category | Count |
|---|---|
| Memo | 4 |
| Explainer | 4 |
| Profile | 3 |
| Service / how-to | 3 |
| Opinion / op-ed | 3 |
| Personal nonfiction | 3 |
Prompts are intentionally short and direct. They do not ask for Zinsser pastiche or elaborate stylistic role-play. The benchmark measures writing quality, not prompt-following on a baroque instruction set.
Everything below is for people who want to run ZinsserBench themselves.
benchmark_versions/<version>/
prompts.json
rubric.json
models.json
judges.json
runs/<run_name>/
manifest.json
outputs/
judgments/
analysis/
Benchmark versions are intended to be immutable. If you materially change prompts, rubric, or scoring policy, create a new version directory and rerun.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Or run directly from source:
PYTHONPATH=src python3 -m zinsserbench --helpModel selection is versioned:
benchmark_versions/<version>/models.json-- candidate modelsbenchmark_versions/<version>/judges.json-- judge panel
OpenRouter is the default backend.
cp .env.example .env
# edit .env and set OPENROUTER_API_KEY=...
zinsserbench run \
--root . \
--benchmark-version v0.1 \
--run-name my-run \
--backend openrouter \
--generation-concurrency 4 \
--judge-concurrency 4 \
--reasoning-effort medium \
--max-output-tokens 10000The CLI loads .env and .env.local from --root on startup. Shell environment variables take precedence.
Runs are resumable. If work is partially complete, reuse the same --run-name.
If a run is partly valid but contaminated by provider failures, copy it to a new run name before repairing it.
cp -R runs/old-run runs/new-runThen remove the bad outputs and stale judgments from runs/new-run/, keeping unaffected outputs in place. Resume with the same benchmark version and a fresh --run-name pointed at the copied directory. ZinsserBench will regenerate only the missing artifacts.
Each stage skips artifacts that already exist.
zinsserbench generate --root . --benchmark-version v0.1 --run-name my-run --backend openrouter
zinsserbench judge --root . --benchmark-version v0.1 --run-name my-run --backend openrouter
zinsserbench analyze --root . --run-name my-runThe repo includes defensive handling for provider quirks observed in live runs:
- Some providers return
content: nullafter spending tokens on reasoning. - Some providers need a retry with reasoning disabled and a larger token budget.
429responses withretry_after_secondsare honored, not treated as fatal.- OpenRouter requests require provider parameter support so routing does not silently ignore requested controls.
- Judge calls stay in JSON mode.
When OpenRouter supports reasoning controls for a model, ZinsserBench sends reasoning effort while excluding returned reasoning blocks by default.
After a run, runs/<run_name>/analysis/ contains:
REPORT.md-- human-readable summarysummary.jsonquarantined_outputs.csvexact_cap_hits.csvtruncation_warnings.csvsanitization_warnings.csvskipped_same_company_judgments.csvexcluded_for_insufficient_judges.csvresponse_lengths_by_model.csvwriting_by_model.csv-- the leaderboard withcriteria_average,overall_average, and the gap between themwriting_by_model_axis.csv,writing_by_model_category.csv,writing_by_model_prompt.csvwriting_by_prompt_axis.csvjudge_quality.csvmodel_prompt_details.csv-- the main drill-down table for a specific model + prompt- Headline SVG charts plus comparison/drill-down SVGs such as
criteria_average.svg,overall_average.svg,overall_vs_criteria.svg,criteria_minus_overall.svg, andaxis_heatmap.svg
PYTHONPATH=src python3 -m unittest discover -s tests -v