Harness engineering for AI-assisted extended research work.
In February 2026, Mitchell Hashimoto gave a name to a practice that production AI teams had been converging on independently: harness engineering — the idea that anytime an AI agent makes a mistake, you engineer a solution so that mistake never happens again. Days later, OpenAI published how three engineers used a harness to produce over a million lines of code with zero manually typed lines. The term spread because it described something real.
This repository contains a research harness: thirteen protocols and templates developed over 130+ collaborative sessions across a five-chapter doctoral thesis, plus a systematic stress-testing campaign in April 2026 that surfaced failure modes the earlier protocols could not see. Every protocol exists because something went wrong in practice — a rejected draft, a fabricated citation, a figure plotting wrong values, a session that lost track of what the previous session decided, a review pass that passed a draft which a different model immediately shredded.
Originally developed for doctoral thesis writing, the underlying patterns apply to any extended AI-assisted intellectual work — literature reviews, grant proposals, long-form reports, technical documentation, book chapters — anywhere a researcher collaborates with an LLM across many sessions on a single structured artifact.
The harness engineering conversation is currently dominated by coding agents. But the core problems are domain-agnostic:
| Problem | Coding agents | Research agents |
|---|---|---|
| No memory between sessions | Agent forgets codebase context | Agent forgets what was written, decided, and verified |
| Plausible but wrong output | Code compiles but has bugs | Prose reads well but contains fabricated claims |
| Systematic, recurring errors | Same architectural violations | Same epistemic failures (unverified comparisons, agent-action confusion) |
| Verification requires structure | Linters, tests, CI | Data verification tables, three-stage review, citation pipelines |
| Self-review fails | Agent misses bugs in its own code | Agent misses errors in its own prose (self-preference bias) |
| Context window limits | Large codebases exceed context | Multi-chapter documents exceed context |
Coding agents have linters, test suites, and cross-model code review in production pipelines. Research agents have nothing comparable — until now. This toolkit is the linter, the test suite, the CI pipeline, and the adversarial reviewer for extended research work with AI.
Thirteen protocols organized by the phase of work they govern.
| File | Role |
|---|---|
| Session_Management_Protocol.md | The INDEX file is the AGENTS.md of a research project. Handoff documents persist state across sessions. Startup/shutdown protocols enforce that the agent reads before it writes. |
| Chapter_Operations_Template.md | Chapter-level coordination: structure, decisions, conventions, primary data file references, and error triggers. Read at every session. |
| File | Role |
|---|---|
| Pre_Writing_Discussion_Protocol.md | Structured three-phase interview to extract the researcher's knowledge before writing. Free-form input, committee-member Q&A, decision recording. |
| Writing_Brief_Template.md | A 9-step plan for each section: boundary rules, upstream commitments, internalization check, paragraph-level outline with argumentative purposes, verified data table. |
| Paper_Adaptation_Protocol.md | When the source is a published paper, the agent adapts rather than generates. Side-by-side verification classifies every change. |
| File | Role |
|---|---|
| Error_Pattern_Library.md | 33 error patterns with a pre-write checklist the agent executes before producing any output. Wrong/right sentence pairs for every high-risk pattern. |
| File | Role |
|---|---|
| Three_Stage_Review_Protocol.md | Post-draft review in three independent passes: cold read, researcher annotations, committee-member challenge. |
| Cross_Model_Review_Protocol.md | New (April 2026). Writer and reviewer must be different models from different families. Defeats self-preference bias — see Empirical Findings below. |
| Programmatic_Style_Audit_Protocol.md | New (April 2026). Full-file scan of style constraints, forbidden phrases, undefined abbreviations, inline arithmetic, and hedge density. Reference implementation: audit.py. |
| Postmortem_Protocol.md | When a draft is rejected entirely: 5-part investigation (chronology, diagnosis, root cause, reflection, action items). Feeds new patterns into the error library. |
| File | Role |
|---|---|
| Citation_Management_System.md | Three-pipeline system: gap analysis with dual-AI cross-verification, citation verification against source papers, programmatic insertion. |
| Figure_Generation_System.md | Raw data to publication-quality figures: data verification before plotting, saved reproducible scripts, researcher review gates, programmatic document integration. |
| File | Role |
|---|---|
| TASK_File_Template.md | Self-contained instruction files for AI coding agents. Goal-oriented with context and decision frameworks, not cell-address-level micromanagement. |
- Copy all thirteen files into your project directory.
- Start with
Session_Management_Protocol.md: create a Knowledge Base directory and INDEX file. - Before writing, conduct a discussion (
Pre_Writing_Discussion_Protocol.md) and create a Writing Brief (Writing_Brief_Template.md). - Copy the pre-write checklist from
Error_Pattern_Library.mdinto your AI's system prompt. - Write.
- Run
Programmatic_Style_Audit_Protocol.mdviaaudit.pyon the draft. - Run
Cross_Model_Review_Protocol.md— the cross-model critic catches what your writing model will not. - Run
Three_Stage_Review_Protocol.mdon the cross-model-reviewed draft. - Revise and hand to the researcher.
Adopt individual files where you see failures:
- Sessions feel disconnected →
Session_Management_Protocol.md - AI keeps making the same mistakes →
Error_Pattern_Library.md - Drafts need heavy rewriting →
Writing_Brief_Template.md+Pre_Writing_Discussion_Protocol.md - Citations are unreliable →
Citation_Management_System.md - Adapting published papers →
Paper_Adaptation_Protocol.md - Review passes miss things →
Cross_Model_Review_Protocol.md - Style consistency degrades across long sections →
Programmatic_Style_Audit_Protocol.md
In April 2026 I stress-tested the original eleven protocols against a corpus of 527 prior conversations, 159 archived drafts, and 254 finalized thesis paragraphs. Two findings drove the addition of the two new protocols in this release.
Self-preference bias is not hypothetical. When the same language model writes and reviews, the review reports near-zero issues. In one test, Claude Opus reviewing its own prior Claude-written paragraphs found zero issues in 51 paragraphs; Claude Sonnet reviewing the identical paragraphs found 11 issues; Claude Haiku found four. When Opus reviewed a paragraph written by Sonnet (not Opus), it found 10 substantive issues. Gemini 3.1 Pro reviewing the same paragraph independently surfaced the four most consequential of those 10 findings, despite sharing no training data with Claude. This is the motivation for Cross_Model_Review_Protocol.md.
Long-session style degradation is invisible to attention-bound review. The Error Pattern Library documented Pattern #31 (style abandonment after 5+ paragraphs) over a year before the stress test. What the stress test added was confirmation that Pattern #29 (lazy audit) combines with it to form an unreviewable failure: a Claude auditing its own 25-paragraph section will count roughly a third of the actual violations. A ten-line Python script counts all of them. This is the motivation for Programmatic_Style_Audit_Protocol.md.
Both new protocols are generalizations from specific findings, not novel ideas. The self-preference bias phenomenon is documented in Zheng et al. (2024) and Panickssery et al. (2024); the programmatic-audit pattern is standard in software engineering. The contribution of this release is porting them into an integrated research-writing workflow alongside the eleven protocols that already existed.
Human-in-the-loop at every stage. The researcher is the intellectual driver. The AI structures, drafts, searches, and checks. The researcher decides, judges, verifies, and approves. This matches what the harness engineering community has converged on: humans steer, agents execute.
Earned through failure, not designed in advance. Every protocol exists because something went wrong. The 33 error patterns are not hypothetical. The postmortem protocol exists because drafts were rejected and rewritten from scratch. The cross-model review protocol exists because a same-model review reported zero issues on a paragraph a different model could shred in seconds.
Verification over trust. Pre-write checklists, data verification tables, side-by-side comparisons, three-stage reviews, dual-AI citation cross-verification, cross-model prose review, programmatic style audits. These exist because AI output cannot be trusted without independent checks — the same insight that drives linters, tests, and CI in coding harnesses.
Different model families for review. A Claude reviewing a Claude is not a review; it is a rubber stamp. A Claude reviewing a different Claude is a partial review. A Gemini reviewing a Claude (or vice versa) is the closest thing to an independent review available without involving a human. The Cross-Model Review Protocol operationalizes this.
- An AI assistant that can read and write files (or receive file contents in conversation).
- A text editor for markdown files.
- For citation, figure, and style-audit pipelines: basic familiarity with Python.
- For cross-model review: API access to at least one model outside the family of your primary writing model. Gemini API is the tested pairing when Claude is the writer; the principle applies in any cross-family direction.
- Platform-agnostic. Tested primarily with Claude and Gemini but applicable to any capable LLMs.
- Adds overhead. Trades speed for correctness. For short, low-stakes documents, the full workflow is unnecessary.
- Biased toward STEM dissertation writing in English. Protocols for humanities writing, non-English writing, or non-dissertation long-form work would need adaptation.
- Assumes the researcher has deep domain expertise. The harness cannot substitute for subject-matter judgment; it prevents AI-specific failure modes but does not certify scientific correctness.
- Cross-model review costs API calls on both sides. Budget accordingly.
- Mitchell Hashimoto, "My AI Adoption Journey" (February 2026) — origin of the term "harness engineering"
- OpenAI, "Harness engineering: leveraging Codex in an agent-first world" (February 2026) — the coding harness that sparked the conversation
- Birgitta Böckeler / Martin Fowler, "Harness Engineering" (February 2026) — guides and sensors framework
- Anthropic, "Effective harnesses for long-running agents" (November 2025) — state externalization and session management
- Zheng, L., et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS.
- Panickssery, A., Bowman, S. R., and Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. NeurIPS.
- Wataoka, K., Takahashi, T., and Ri, R. (2024). Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819.
MIT License. See LICENSE for details.
Li, Z. (2026). Research Harness: Harness Engineering for AI-Assisted Extended Research Work.
GitHub. https://github.com/AlbanLi0314/research-harness