Teller

An open-source grounded reasoning agent for enterprise document QA.

Built by Dolores Research. Validated on the OfficeQA benchmark by Databricks.

Metric	Result
Accuracy	71.5% (176/246) on OfficeQA — highest in the competition
Rank	#1 on Sentient Arena Cohort 0 leaderboard
Score	187.823 (peak: 192.046)
Cost	$1.85 total ($0.0075/question)
Model	MiniMax M2.5 (10B active params)
vs. Claude Opus 4.5	+3.7% accuracy at 1/500th the cost

The performance gap on enterprise document reasoning is behavioral, not capability-based. A 10B-parameter model, properly orchestrated, outperforms models 20x its size.

What This Agent Does

Teller is a document-to-answer system — it retrieves, extracts, computes, and validates answers from large document archives in a single pipeline:

  Question (natural language)
         │
         ▼
  ┌─────────────────┐
  │  1. RETRIEVE     │  grep corpus → locate relevant files
  │  2. EXTRACT      │  parse tables → trace column hierarchies → get raw values
  │  3. COMPUTE      │  Python/scipy → 20+ statistical formulas → precise results
  │  4. VALIDATE     │  unit check → fiscal year → plausibility → write answer
  └─────────────────┘
         │
         ▼
  Answer (bare number, 1% tolerance)

No existing product does all four steps. Document search tools (Elastic, AlphaSense) find documents but don't compute. Parsing tools (Textract, Unstructured.io) extract data but don't reason. Analytics tools (Python, Excel) compute but require manual data entry. Teller goes from raw documents to validated numerical answers.

Quick Start

1. Install Goose

Goose is the open-source AI agent framework by Block Inc. that powers the runtime.

# macOS/Linux
brew install block-goose-cli

# or via install script
curl -fsSL https://github.com/block/goose/releases/download/stable/download_cli.sh | bash

2. Configure

git clone https://github.com/Leonwenhao/teller.git
cd teller

# Set your OpenRouter API key
cp .env.example .env
# Edit .env: OPENROUTER_API_KEY=sk-or-v1-...

# Configure goose to use OpenRouter
export GOOSE_PROVIDER="openrouter"
export GOOSE_MODEL="minimax/minimax-m2.5"
source .env

3. Run

Interactive mode — ask questions about documents in a conversation:

goose run --recipe recipe.yaml

Headless mode — answer a specific question:

goose run --recipe recipe.yaml --params question="What was the total public debt in FY2020 in nominal dollars?"

Benchmark mode — evaluate against OfficeQA (requires Docker):

# Pull the Treasury Bulletin corpus
docker pull ghcr.io/sentient-agi/harbor/officeqa-corpus:latest

# Run evaluation on 20 questions
python3 scripts/standalone_eval.py --limit 20

# Full benchmark (246 questions, ~4 hours, ~$12)
python3 scripts/standalone_eval.py

Architecture

┌──────────────────────────────────────────────────┐
│                  Goose Runtime                    │
│         (auto-compaction at 80% context)          │
│                                                   │
│  ┌─────────────────────────────────────────────┐  │
│  │          Prompt Template (84 lines)          │  │
│  │  3 iron rules + workflow + formulas + traps  │  │
│  └─────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────┐  │
│  │         Domain Skills (138 lines)            │  │
│  │  retrieval strategy + table parsing + CPI    │  │
│  └─────────────────────────────────────────────┘  │
│                       │                           │
│              Model: MiniMax M2.5                  │
│              (via OpenRouter)                     │
└──────────────────┬───────────────────────────────┘
                   │ grep/sed/python
          ┌────────▼────────┐
          │  Document Corpus │
          │  (697 TXT files) │
          └─────────────────┘

Why Goose? The agent harness was the single biggest performance breakthrough. Goose's auto-compaction at 80% context utilization preserves instruction fidelity across long multi-step reasoning chains. Without it, accuracy degrades from 80% early to 60% late in a run as tool outputs dilute the original instructions. With it, accuracy holds at 71-73% throughout. This also reduced cost by 87% ($14.50 to $1.85).

How It Works

The Three Iron Rules

These account for more accuracy gain than any other component:

Write answer in every code block. Every Python block ends with open('/app/answer.txt','w').write(str(result)). A rough answer beats an empty file. Removing this rule caused a 22-answer regression.
All math in Python. Uses scipy, numpy, statsmodels. Never computes in natural language. Prevents precision errors beyond the 1% tolerance.
Stop after writing final answer. No re-verification, no second-guessing. Prevents "self-sabotage" — overwriting a correct answer during an unnecessary check.

20+ Named Statistical Formulas

The prompt embeds exact implementations for: linear regression, t-statistic, Theil index, CAGR, YoY growth, continuously compounded growth, Expected Shortfall (CVaR), arc elasticity, HHI, Gini coefficient, coefficient of variation, hazard rate, Winsorized statistics, Box-Cox transform, HP filter, ARIMA, polynomial regression, and more.

Domain Knowledge Injection

Unit conversion rules — the #1 error pattern. Check four places: table title, header row, column header, footnotes.
Fiscal year conventions — pre-1976 (Jul-Jun) vs post-1976 (Oct-Sep).
Treasury terminology traps — "gross debt" vs "debt held by public", "receipts" vs "receipt instruments", "reported IN" vs "reported FOR".
CPI-U reference data — embedded annual averages (1938-2020) for inflation adjustments.
Retrospective table strategy — for multi-year questions, search the latest year first for a historical summary table.

Methodology: EvoSkill, Done By Hand

Our optimization methodology parallels Sentient's EvoSkill framework, which auto-discovers agent skills from failed trajectories. EvoSkill improved OfficeQA accuracy from 60.6% to 67.9%. We applied the same core loop manually and achieved 71.5%:

                    ┌───────────────────┐
                    │  Submit agent run  │
                    └────────┬──────────┘
                             │
                    ┌────────▼──────────┐
                    │  Classify failures │
                    │  retrieval │       │
                    │  extraction │      │
                    │  computation │     │
                    │  behavioral        │
                    └────────┬──────────┘
                             │
                    ┌────────▼──────────┐
                    │  Prioritize by EV  │
                    │  (questions × P ×  │
                    │   score impact)    │
                    └────────┬──────────┘
                             │
                    ┌────────▼──────────┐
                    │  Create new skill  │
                    │  (formula, trap,   │
                    │   reference data)  │
                    └────────┬──────────┘
                             │
                    ┌────────▼──────────┐
                    │  Evaluate + keep/  │
                    │  revert (>5 var.)  │
                    └────────┬──────────┘
                             │
                             └──────────► repeat

Cross-submission variance analysis across 6 runs revealed:

114 always-pass questions (46.3%) — reliable base, don't regress these
43 always-fail questions (17.5%) — structural gaps, target these with new skills
89 swing questions (36.2%) — variance-driven, not directly addressable

The always-fail analysis identified specific missing formulas and reference data, which we added as targeted skills. This is exactly the EvoSkill hypothesis: agent skills can be systematically discovered from failure patterns.

See FINAL_REPORT.md for the full research paper with methodology, ablations, and results.

Enterprise Use Cases

The capabilities validated on OfficeQA directly transfer to:

Financial Services — SEC filing analysis, credit research, regulatory reporting. The agent parses hierarchical financial tables, handles unit conversions, and computes ratios across multi-year filings.

Government & Public Sector — Budget analysis, GAO auditing, FOIA response. Validated on 86 years of actual government documents with fiscal year reasoning built in.

Audit & Compliance — Cross-document verification, variance analysis, financial statement reconciliation. Full reasoning traces provide auditability.

Insurance & Actuarial — Loss reserving, risk assessment. Native computation of CVaR, hazard rates, HHI, regression, and ARIMA from extracted document data.

Key Findings

Finding	Evidence
Prompt density > prompt volume	84 lines outperformed 543 lines by +2.4% accuracy
Agent harness > prompt engineering	Goose vs OpenHands SDK = +5 correct answers, zero prompt changes
"Write answer in every block" is load-bearing	Removing it = -22 answers (8.9% regression)
MoE models need domain knowledge, not tutorials	MiniMax M2.5: #1 tool calling (76.8%), but 38% on FinanceAgent
Context management is a first-order variable	Without compaction: 80% early, 60% late. With: 71-73% throughout
Infrastructure load affects benchmarks	2 AM: 171s latency, 69.5% accuracy. Daytime: 210s, 66.1%

Repository Structure

dolores-docagent/
├── recipe.yaml                  # Goose recipe — run this
├── arena.yaml                   # Sentient Arena config
├── prompts/
│   └── goose_prompt.j2          # 84-line prompt template (the core IP)
├── skills_consolidated/
│   └── core/SKILL.md            # 138-line domain knowledge package
├── skills/                      # Extended skills (542 lines, 5 files)
├── scripts/
│   ├── standalone_eval.py       # Docker-based evaluation
│   └── aggregate_results.py     # Results analysis
├── officeqa_full.csv            # OfficeQA ground truth (246 questions)
├── FINAL_REPORT.md              # Full research paper
├── SUBMISSION_REPORT.md         # Competition report
└── program.md                   # AutoResearch self-improvement loop

Sentient Arena Context

This agent was developed during Sentient Arena Cohort 0 — a two-week competition stress-testing enterprise AI agents on the OfficeQA benchmark. All participants used the same model (MiniMax M2.5) and corpus, isolating behavioral engineering as the sole variable.

Sentient's ecosystem includes:

Arena — Production-grade evaluation for enterprise AI agents
ROMA — Recursive Open Meta-Agent framework (5K+ stars)
EvoSkill — Auto-discovers agent skills from failed trajectories
OpenDeepSearch — Open-source search that outperforms Perplexity
GRID — Planetary-scale network connecting 40+ agents

Our methodology validates EvoSkill's core thesis and demonstrates it can be applied to achieve state-of-the-art results on enterprise document reasoning.

Adapting to Your Own Documents

To use this agent on your own document corpus:

Prepare your corpus — convert documents to text files in a directory
Edit the prompt template — replace Treasury-specific traps with your domain knowledge
Keep the iron rules — the three behavioral rules are domain-agnostic
Keep the formulas — the statistical formula library is reusable
Run the EvoSkill loop — submit, classify failures, add targeted skills, repeat

The domain-specific content (fiscal year rules, Treasury terminology, CPI data) is clearly sectioned in the prompt and skills files for easy replacement.

Citation

@misc{liu2026behavioral,
  title={Behavioral Engineering for Grounded Reasoning: A Systematic Approach to Enterprise Document QA},
  author={Liu, Leon (柳文浩)},
  year={2026},
  note={Dolores Research. Sentient Arena Cohort 0, \#1 on leaderboard.
        70.7\% accuracy on OfficeQA with MiniMax M2.5 (10B active params).}
}

License

MIT License. See LICENSE.

Dolores Research — Behavioral engineering for enterprise AI agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teller

What This Agent Does

Quick Start

1. Install Goose

2. Configure

3. Run

Architecture

How It Works

The Three Iron Rules

20+ Named Statistical Formulas

Domain Knowledge Injection

Methodology: EvoSkill, Done By Hand

Enterprise Use Cases

Key Findings

Repository Structure

Sentient Arena Context

Adapting to Your Own Documents

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
prompts		prompts
scripts		scripts
skills		skills
skills_consolidated/core		skills_consolidated/core
.env.example		.env.example
.gitignore		.gitignore
FINAL_REPORT.md		FINAL_REPORT.md
LICENSE		LICENSE
README.md		README.md
SUBMISSION_REPORT.md		SUBMISSION_REPORT.md
arena.yaml		arena.yaml
officeqa_full.csv		officeqa_full.csv
program.md		program.md
pyproject.toml		pyproject.toml
recipe.yaml		recipe.yaml

Folders and files

Latest commit

History

Repository files navigation

Teller

What This Agent Does

Quick Start

1. Install Goose

2. Configure

3. Run

Architecture

How It Works

The Three Iron Rules

20+ Named Statistical Formulas

Domain Knowledge Injection

Methodology: EvoSkill, Done By Hand

Enterprise Use Cases

Key Findings

Repository Structure

Sentient Arena Context

Adapting to Your Own Documents

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages