Autonomous Research and Evaluation Agent

Overview

This project is a command-line research pipeline that turns a natural-language query and a local evidence corpus into a structured markdown report, critiques that report with a structured JSON evaluation, and optionally revises the draft based on that feedback. Retrieval is fully local (no vector DB required for the baseline); generation, evaluation, and revision use the Groq API (llama-3.3-70b-versatile).

You can run it from the CLI, a Streamlit upload UI (app.py), or a Next.js web app under web/ that calls the pipeline either via a local Next API route or a deployed Python API (e.g. Render + Vercel — see Deploy: Render + Vercel).

Live demo

A hosted instance is available for trying the web UI against a remote API (same stack as in Deploy: Render + Vercel):

	Link
Web app (Vercel)	autonomous-research-eval-agent.vercel.app
API health (Render)	research-eval-agent-api.onrender.com/health

The API may be slow on the first request after idle time on a free tier (cold start). Replace these URLs in the README if you use different deployment domains.

Tech stack

Layer	Technologies
Language (backend)	Python 3
LLM / API	Groq (`groq` SDK, HTTP via `httpx`) — research, evaluation, revision, and grounding agents
Validation & schemas	Pydantic v2 (`pydantic`, `pydantic_core`)
CLI	Typer + Rich
Env	`python-dotenv` (`.env` at repo root)
Documents	`pypdf` for PDFs; plain text / markdown loaders in `src/tools/`
Optional UI (Python)	Streamlit (`streamlit run app.py`) — single-file upload + query
Web UI	Next.js 14 (App Router), React 18, TypeScript, Tailwind CSS 3
Web → Python bridge	Local: Next.js route `web/app/api/evaluate` spawns the repo venv Python and `scripts/pipeline_json_stdout.py`. Production: FastAPI app `render_api/main.py` exposes `POST /api/evaluate`; set `NEXT_PUBLIC_API_URL` on Vercel to the Render service URL
Other Python deps	See `requirements.txt` (e.g. `tenacity`, `requests`; some packages are shared/transitive dependencies)

Model note: Default Groq chat model is set in src/agents/groq_client.py (MODEL, e.g. llama-3.3-70b-versatile); change it there if you switch models.

Prerequisites: Python 3 with pip (venv recommended). For the web app, Node.js 18+ and npm (or compatible package manager).

The design separates retrieval, research (drafting), evaluation, and revision into explicit stages so each step can be inspected, logged, and extended independently.

OpenClaw-Oriented Mode

In addition to the standard CLI pipeline, the project includes an OpenClaw-oriented runner that presents the workflow as explicit agent roles:

OpenClaw Research Agent
OpenClaw Evaluation Agent
OpenClaw Revision Agent
OpenClaw Grounding Agent

Run it with:

python src/openclaw_runner.py "Analyze EV adoption trends in Nepal"

Features

Local retrieval — Loads .txt, .md, and .pdf from the default data/ folder (recursive), or from --file / --data-dir; splits text into paragraph chunks; ranks by keyword overlap with the query; returns the top passages for grounding. PDFs use pypdf; encoding issues and bad PDFs are handled with warnings, not crashes.
Research agent — Produces a structured markdown draft (executive summary, findings, evidence, gaps, conclusion) using only the retrieved chunks.
Evaluation agent — Returns structured JSON aligned with a Pydantic schema: overall score plus subscores (relevance, completeness, clarity, evidence usage on a 1–10 scale), issues, and suggested fixes. The overall score is normalized to match the mean of the subscores for internal consistency.
Revision agent — Rewrites the draft using the evaluation JSON when revision runs.
Conditional revision — If the evaluation score is ≥ 8.0 (configurable threshold in code), the revision step is skipped and the final report is the draft; otherwise the reviser runs.
Traceable outputs — Each run writes numbered artifacts under outputs/, including retrieved chunks, draft, evaluation, final report, grounding audit JSON, and a run summary (with optional grounding_score).

Architecture


┌─────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Local │ │ Research │ │ Evaluation │
│ retrieval │────▶│ agent (Groq) │────▶│ agent (Groq) │
│ (.txt/.md/.pdf)│ │ draft MD │ │ JSON + schema │
└─────────────┘ └────────┬────────┘ └────────┬─────────┘
│ │
│ score < 8? │
│ ▼
│ ┌───────────────┐
└──────────────│ Revision │
(skip) │ agent (Groq) │
└───────┬───────┘
▼
Final markdown + grounding audit + summary

src/main.py / src/openclaw_runner.py — CLI entrypoints; call src/pipeline.py for orchestration.
src/pipeline.py — Retrieval → draft → evaluate → conditional revise → grounding audit → numbered outputs.
src/agents/ — Research, evaluation, revision, and grounding agents; shared groq_client and prompt_loader.
src/tools/retrieval_tool.py & document_loaders.py — Load .txt / .md / .pdf, paragraph chunking, keyword scoring.
src/schemas/ — evaluation_schema, grounding_audit_schema (Pydantic validation).

Folder Structure


research-eval-agent/
├── app.py                   # Streamlit UI (upload + query → pipeline)
├── data/                    # Default corpus (.txt, .md, .pdf)
├── outputs/                 # Run artifacts (gitignored)
├── render_api/
│   └── main.py              # FastAPI server for Render (POST /api/evaluate)
├── scripts/
│   └── pipeline_json_stdout.py   # JSON-on-stdout entry for the local Next.js API
├── web/                     # Next.js app (research evaluation results UI)
│   ├── app/                 # App Router: page + /api/evaluate
│   ├── components/
│   ├── lib/
│   └── package.json
├── src/
│   ├── main.py              # CLI
│   ├── openclaw_runner.py   # OpenClaw-styled CLI
│   ├── pipeline.py          # Shared orchestration (+ upload path for API/Streamlit)
│   ├── agents/
│   ├── prompts/
│   ├── schemas/
│   └── tools/
│       ├── retrieval_tool.py
│       └── document_loaders.py
├── .env
├── requirements.txt
└── README.md

Setup

Clone the repository and open a terminal at the project root.
Create a virtual environment (recommended):
```
python -m venv .venv
```
Activate the environment (examples):
- Windows (PowerShell): .venv\Scripts\Activate.ps1
- macOS/Linux: source .venv/bin/activate
Install dependencies:
```
pip install -r requirements.txt
```
Configure environment variables (see next section).

Next.js web UI (optional)

From the repo root, with the same Python venv and .env as above:

cd web
npm install
npm run dev

Open the URL shown in the terminal (usually http://localhost:3000). The UI uploads a document, sends a query to POST /api/evaluate, and displays evaluation scores, report tabs, and actions.

Path resolution: The API looks for the repository root as the parent of web/ by default. If you run Next from a different layout, set RESEARCH_AGENT_ROOT to the absolute path of this repo so Python can find scripts/pipeline_json_stdout.py and .venv.

Streamlit UI (optional)

From the project root with venv activated:

streamlit run app.py

Upload a .txt, .md, or .pdf, enter a query, and run the same pipeline as the CLI (single-document flow).

Environment Variables

Variable	Where	Required	Description
`GROQ_API_KEY`	Repo `.env`, Render	Yes (for LLM)	API key for Groq.
`ALLOWED_ORIGINS`	Render only	No	Comma-separated origins for CORS (e.g. your Vercel URL). Empty = allow any origin.
`NEXT_PUBLIC_API_URL` or `NEXT_PUBLIC_API_BASE_URL`	Vercel, `web/.env.local`	No	Render service origin, no trailing slash (either name works). If unset, the Next app uses `/api/evaluate` locally.
`RESEARCH_AGENT_ROOT`	Local Next only	No	Absolute path to this repo when the default parent-of-`web/` root is wrong.

Create a .env file in the project root (same directory as requirements.txt):

GROQ_API_KEY=your_key_here

The CLI loads .env automatically via python-dotenv.

Deploy: Render (API) + Vercel (frontend)

Split deployment uses the FastAPI app in render_api/main.py on Render and the Next.js app in web/ on Vercel. The browser calls your Render URL directly (NEXT_PUBLIC_API_URL); the Next.js route web/app/api/evaluate is only used when that variable is unset (local development).

Render (Python web service)

Create a Web Service, connect this repo, root directory = repository root.
Build command: pip install -r requirements.txt
Start command: uvicorn render_api.main:app --host 0.0.0.0 --port $PORT
Do not use uvicorn app:app — there is no app.py ASGI module; the FastAPI instance is app inside render_api/main.py. If deploy logs say Attribute "app" not found in module "app", fix the start command here and redeploy.
Environment variables (dashboard):
- GROQ_API_KEY — required (same as local).
- ALLOWED_ORIGINS — comma-separated origins allowed for CORS, e.g. https://your-app.vercel.app and http://localhost:3000 for testing. If omitted, the API allows any origin (convenient for experiments; set explicitly for production).
Optional: use render.yaml in the repo as a Blueprint. Health check path: /health.

Vercel (Next.js)

Import the repo; set Root Directory to web.
Environment variable: NEXT_PUBLIC_API_URL or NEXT_PUBLIC_API_BASE_URL = your Render service origin with no trailing slash, e.g. https://research-eval-api.onrender.com. The client calls {origin}/api/evaluate.
Redeploy after changing env vars.

See web/.env.example for the frontend variable. Keep GROQ_API_KEY only on Render (server-side), not in NEXT_PUBLIC_* on Vercel.

Note: Free Render instances may spin down when idle; the first request after sleep can take tens of seconds.

How to Run

From the project root, pass the research question as positional arguments (words are joined into one query string).

Default corpus (recursive scan of data/ for .txt, .md, .pdf):

python src/main.py What are the main factors in Nepal electric vehicle adoption?

Single file (mutually exclusive with --data-dir):

python src/main.py --file path/to/report.pdf Your question here

Custom directory (recursive scan; mutually exclusive with --file):

python src/main.py --data-dir path/to/corpus Your question here

The pipeline prints five stages ([1/5] … [5/5]) and writes files under outputs/. If the evaluation score is below the revision threshold, the log shows whether the reviser changed the draft. openclaw_runner.py accepts the same --file / --data-dir options.

Example Output Files

After a successful run, outputs/ typically contains:

File	Description
`00_retrieved_chunks.json`	Query plus ranked chunks (filename, text, retrieval score).
`01_draft_report.md`	Markdown draft from the research agent.
`02_evaluation.json`	Structured JSON evaluation (scores, issues, suggested fixes).
`03_final_report.md`	Final report: either the revised draft or a copy of the draft if revision was skipped.
`04_run_summary.json`	Query, chunk counts, unique sources, evaluation score, revision flags, `grounding_score` (when audit ran), ISO timestamp.
`05_grounding_audit.json`	Grounding audit: supported/unsupported points, notes, `grounding_score` (1–10).

Older filenames (draft_report.md, evaluation.json, etc.) are legacy; the numbered convention is what the current orchestrator writes.

Why This Is Agentic

Role separation — Distinct agents with dedicated prompts and responsibilities (research vs. critique vs. edit), rather than a single monolithic prompt.
Tool-augmented grounding — A retrieval step supplies evidence before generation, reducing unsupported generation relative to a bare LLM call.
Structured critique — The evaluation agent emits machine-readable JSON (schema-validated), which can drive downstream automation or UI, not only human reading.
Closed-loop improvement — The revision agent consumes evaluation feedback to refine the draft when scores indicate room for improvement.
Governed autonomy — Conditional revision implements a policy: high-quality drafts skip an extra model call; lower scores trigger deliberate revision.

Together, these properties match a practical notion of agentic systems: modular actors, explicit state (artifacts), tools, and policy-controlled loops.

Future Improvements

Better retrieval — Embeddings, hybrid search, or citation-span extraction for finer-grained evidence.
Configurable thresholds and models — CLI flags or config file for revision cutoff, top_k, temperature, and model name.
Tests — Unit tests for retrieval scoring and evaluation JSON validation; integration tests with mocked LLM responses.
Observability — Structured logging, optional export of token usage, and run IDs for comparing experiments.
More formats — HTML or DOCX ingestion with preprocessing while keeping the same agent contract.

Portfolio note: This README describes the intended behavior of the repository as implemented in src/; extend the “Future Improvements” section as you ship features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Research and Evaluation Agent

Overview

Live demo

Tech stack

OpenClaw-Oriented Mode

Features

Architecture

Folder Structure

Setup

Next.js web UI (optional)

Streamlit UI (optional)

Environment Variables

Deploy: Render (API) + Vercel (frontend)

How to Run

Example Output Files

Why This Is Agentic

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
render_api		render_api
sample_outputs/ev_nepal		sample_outputs/ev_nepal
scripts		scripts
src		src
test_data		test_data
web		web
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
render.yaml		render.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Autonomous Research and Evaluation Agent

Overview

Live demo

Tech stack

OpenClaw-Oriented Mode

Features

Architecture

Folder Structure

Setup

Next.js web UI (optional)

Streamlit UI (optional)

Environment Variables

Deploy: Render (API) + Vercel (frontend)

How to Run

Example Output Files

Why This Is Agentic

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages