Skip to content

AlbanLi0314/research-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Research Harness

Harness engineering for AI-assisted extended research work.

In February 2026, Mitchell Hashimoto gave a name to a practice that production AI teams had been converging on independently: harness engineering — the idea that anytime an AI agent makes a mistake, you engineer a solution so that mistake never happens again. Days later, OpenAI published how three engineers used a harness to produce over a million lines of code with zero manually typed lines. The term spread because it described something real.

This repository contains a research harness: thirteen protocols and templates developed over 130+ collaborative sessions across a five-chapter doctoral thesis, plus a systematic stress-testing campaign in April 2026 that surfaced failure modes the earlier protocols could not see. Every protocol exists because something went wrong in practice — a rejected draft, a fabricated citation, a figure plotting wrong values, a session that lost track of what the previous session decided, a review pass that passed a draft which a different model immediately shredded.

Originally developed for doctoral thesis writing, the underlying patterns apply to any extended AI-assisted intellectual work — literature reviews, grant proposals, long-form reports, technical documentation, book chapters — anywhere a researcher collaborates with an LLM across many sessions on a single structured artifact.

Why a Research Harness?

The harness engineering conversation is currently dominated by coding agents. But the core problems are domain-agnostic:

Problem Coding agents Research agents
No memory between sessions Agent forgets codebase context Agent forgets what was written, decided, and verified
Plausible but wrong output Code compiles but has bugs Prose reads well but contains fabricated claims
Systematic, recurring errors Same architectural violations Same epistemic failures (unverified comparisons, agent-action confusion)
Verification requires structure Linters, tests, CI Data verification tables, three-stage review, citation pipelines
Self-review fails Agent misses bugs in its own code Agent misses errors in its own prose (self-preference bias)
Context window limits Large codebases exceed context Multi-chapter documents exceed context

Coding agents have linters, test suites, and cross-model code review in production pipelines. Research agents have nothing comparable — until now. This toolkit is the linter, the test suite, the CI pipeline, and the adversarial reviewer for extended research work with AI.

The Components

Thirteen protocols organized by the phase of work they govern.

A. Context Engineering (cross-session state)

File Role
Session_Management_Protocol.md The INDEX file is the AGENTS.md of a research project. Handoff documents persist state across sessions. Startup/shutdown protocols enforce that the agent reads before it writes.
Chapter_Operations_Template.md Chapter-level coordination: structure, decisions, conventions, primary data file references, and error triggers. Read at every session.

B. Pre-Writing Planning (before any prose)

File Role
Pre_Writing_Discussion_Protocol.md Structured three-phase interview to extract the researcher's knowledge before writing. Free-form input, committee-member Q&A, decision recording.
Writing_Brief_Template.md A 9-step plan for each section: boundary rules, upstream commitments, internalization check, paragraph-level outline with argumentative purposes, verified data table.
Paper_Adaptation_Protocol.md When the source is a published paper, the agent adapts rather than generates. Side-by-side verification classifies every change.

C. Error Prevention (during writing)

File Role
Error_Pattern_Library.md 33 error patterns with a pre-write checklist the agent executes before producing any output. Wrong/right sentence pairs for every high-risk pattern.

D. Review Pipeline (after writing)

File Role
Three_Stage_Review_Protocol.md Post-draft review in three independent passes: cold read, researcher annotations, committee-member challenge.
Cross_Model_Review_Protocol.md New (April 2026). Writer and reviewer must be different models from different families. Defeats self-preference bias — see Empirical Findings below.
Programmatic_Style_Audit_Protocol.md New (April 2026). Full-file scan of style constraints, forbidden phrases, undefined abbreviations, inline arithmetic, and hedge density. Reference implementation: audit.py.
Postmortem_Protocol.md When a draft is rejected entirely: 5-part investigation (chronology, diagnosis, root cause, reflection, action items). Feeds new patterns into the error library.

E. Specialized Pipelines (domain-specific)

File Role
Citation_Management_System.md Three-pipeline system: gap analysis with dual-AI cross-verification, citation verification against source papers, programmatic insertion.
Figure_Generation_System.md Raw data to publication-quality figures: data verification before plotting, saved reproducible scripts, researcher review gates, programmatic document integration.

F. Autonomous Agent Tasks

File Role
TASK_File_Template.md Self-contained instruction files for AI coding agents. Goal-oriented with context and decision frameworks, not cell-address-level micromanagement.

How to Use

Quick Start

  1. Copy all thirteen files into your project directory.
  2. Start with Session_Management_Protocol.md: create a Knowledge Base directory and INDEX file.
  3. Before writing, conduct a discussion (Pre_Writing_Discussion_Protocol.md) and create a Writing Brief (Writing_Brief_Template.md).
  4. Copy the pre-write checklist from Error_Pattern_Library.md into your AI's system prompt.
  5. Write.
  6. Run Programmatic_Style_Audit_Protocol.md via audit.py on the draft.
  7. Run Cross_Model_Review_Protocol.md — the cross-model critic catches what your writing model will not.
  8. Run Three_Stage_Review_Protocol.md on the cross-model-reviewed draft.
  9. Revise and hand to the researcher.

If You Already Have a Workflow

Adopt individual files where you see failures:

  • Sessions feel disconnected → Session_Management_Protocol.md
  • AI keeps making the same mistakes → Error_Pattern_Library.md
  • Drafts need heavy rewriting → Writing_Brief_Template.md + Pre_Writing_Discussion_Protocol.md
  • Citations are unreliable → Citation_Management_System.md
  • Adapting published papers → Paper_Adaptation_Protocol.md
  • Review passes miss things → Cross_Model_Review_Protocol.md
  • Style consistency degrades across long sections → Programmatic_Style_Audit_Protocol.md

Empirical Findings

In April 2026 I stress-tested the original eleven protocols against a corpus of 527 prior conversations, 159 archived drafts, and 254 finalized thesis paragraphs. Two findings drove the addition of the two new protocols in this release.

Self-preference bias is not hypothetical. When the same language model writes and reviews, the review reports near-zero issues. In one test, Claude Opus reviewing its own prior Claude-written paragraphs found zero issues in 51 paragraphs; Claude Sonnet reviewing the identical paragraphs found 11 issues; Claude Haiku found four. When Opus reviewed a paragraph written by Sonnet (not Opus), it found 10 substantive issues. Gemini 3.1 Pro reviewing the same paragraph independently surfaced the four most consequential of those 10 findings, despite sharing no training data with Claude. This is the motivation for Cross_Model_Review_Protocol.md.

Long-session style degradation is invisible to attention-bound review. The Error Pattern Library documented Pattern #31 (style abandonment after 5+ paragraphs) over a year before the stress test. What the stress test added was confirmation that Pattern #29 (lazy audit) combines with it to form an unreviewable failure: a Claude auditing its own 25-paragraph section will count roughly a third of the actual violations. A ten-line Python script counts all of them. This is the motivation for Programmatic_Style_Audit_Protocol.md.

Both new protocols are generalizations from specific findings, not novel ideas. The self-preference bias phenomenon is documented in Zheng et al. (2024) and Panickssery et al. (2024); the programmatic-audit pattern is standard in software engineering. The contribution of this release is porting them into an integrated research-writing workflow alongside the eleven protocols that already existed.

Design Principles

Human-in-the-loop at every stage. The researcher is the intellectual driver. The AI structures, drafts, searches, and checks. The researcher decides, judges, verifies, and approves. This matches what the harness engineering community has converged on: humans steer, agents execute.

Earned through failure, not designed in advance. Every protocol exists because something went wrong. The 33 error patterns are not hypothetical. The postmortem protocol exists because drafts were rejected and rewritten from scratch. The cross-model review protocol exists because a same-model review reported zero issues on a paragraph a different model could shred in seconds.

Verification over trust. Pre-write checklists, data verification tables, side-by-side comparisons, three-stage reviews, dual-AI citation cross-verification, cross-model prose review, programmatic style audits. These exist because AI output cannot be trusted without independent checks — the same insight that drives linters, tests, and CI in coding harnesses.

Different model families for review. A Claude reviewing a Claude is not a review; it is a rubber stamp. A Claude reviewing a different Claude is a partial review. A Gemini reviewing a Claude (or vice versa) is the closest thing to an independent review available without involving a human. The Cross-Model Review Protocol operationalizes this.

Prerequisites

  • An AI assistant that can read and write files (or receive file contents in conversation).
  • A text editor for markdown files.
  • For citation, figure, and style-audit pipelines: basic familiarity with Python.
  • For cross-model review: API access to at least one model outside the family of your primary writing model. Gemini API is the tested pairing when Claude is the writer; the principle applies in any cross-family direction.
  • Platform-agnostic. Tested primarily with Claude and Gemini but applicable to any capable LLMs.

Limitations

  • Adds overhead. Trades speed for correctness. For short, low-stakes documents, the full workflow is unnecessary.
  • Biased toward STEM dissertation writing in English. Protocols for humanities writing, non-English writing, or non-dissertation long-form work would need adaptation.
  • Assumes the researcher has deep domain expertise. The harness cannot substitute for subject-matter judgment; it prevents AI-specific failure modes but does not certify scientific correctness.
  • Cross-model review costs API calls on both sides. Budget accordingly.

Related Work

  • Mitchell Hashimoto, "My AI Adoption Journey" (February 2026) — origin of the term "harness engineering"
  • OpenAI, "Harness engineering: leveraging Codex in an agent-first world" (February 2026) — the coding harness that sparked the conversation
  • Birgitta Böckeler / Martin Fowler, "Harness Engineering" (February 2026) — guides and sensors framework
  • Anthropic, "Effective harnesses for long-running agents" (November 2025) — state externalization and session management
  • Zheng, L., et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS.
  • Panickssery, A., Bowman, S. R., and Feng, S. (2024). LLM Evaluators Recognize and Favor Their Own Generations. NeurIPS.
  • Wataoka, K., Takahashi, T., and Ri, R. (2024). Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819.

License

MIT License. See LICENSE for details.

Citation

Li, Z. (2026). Research Harness: Harness Engineering for AI-Assisted Extended Research Work.
GitHub. https://github.com/AlbanLi0314/research-harness

About

Harness engineering for AI-assisted extended research work. 13 protocols, 33 error patterns, cross-model review, and programmatic audits from 130+ thesis-writing sessions plus systematic stress testing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages