discovery-agents

A production-grade, LLM-powered multi-agent workflow for enterprise product discovery — with RAG, a ReAct tool-use loop, a rigorous evaluation harness, guardrails, full observability, and an MCP server. Keyless by default, real on demand.

Given a product brief and a corpus of customer evidence, a graph of nine specialized agents clusters the evidence into insights, generates several evidence-grounded product directions, critiques and scores them, selects a winner, and emits a coding-agent-ready handoff packet — every step traced, evaluated, and guarded.

Why it's built this way. You can run the whole thing in 30 seconds with no API key (a deterministic mock provider + in-memory retrieval). Set ANTHROPIC_API_KEY (or --provider cohere/openai) and the same graph runs on a frontier model. CI, tests, and the eval regression gate all run keyless and deterministically.

30-second quickstart (no API key)

pip install -e ".[dev]"
discovery-agents --output outputs/demo        # runs on the deterministic mock provider
discovery-agents --eval                        # run the evaluation harness + regression gate
pytest -q                                      # deterministic, keyless test suite

outputs/demo/ gets: run_summary.md, coding_agent_handoff.md, canvas.html (visual canvas + eval dashboard + trace timeline), agent_run.json, and agent_trace.md.

Run it on a real frontier model

pip install -e ".[anthropic]"      # or .[cohere] / .[openai] / .[gemini]
export ANTHROPIC_API_KEY=...
discovery-agents --provider anthropic --model claude-sonnet-4-6 --output outputs/live

Providers are pluggable behind one LLMClient protocol; if a key or SDK is missing the factory logs a warning and falls back to the mock, so the workflow always runs. The reasoning agents (EvidenceInsight, Strategy, Ideation, Critique, Handoff, Memory) call the model and parse structured JSON; Canvas layout and Selection stay deterministic.

Live demo (FastAPI)

A deployable web app runs the whole pipeline with the reasoning agents on a real LLM (default Gemini). Keyless by default; bring your own key per request, or set a host key for a rate-limited number of free runs.

pip install -e ".[web,gemini]"
uvicorn discovery_agents.webapp:app --reload          # open http://localhost:8000
# optional: a host key enables free server-side runs (visitors can also paste their own)
export GEMINI_API_KEY=...                              # never commit this

GET / — form: pick a dataset (enterprise sample / Banking77-style support), a provider, an optional API key, and a goal override; renders the selected direction, all directions with scores, the eval scorecard, decision memory, and trace totals.
POST /api/run — JSON in {provider?, api_key?, dataset?, brief?, evidence?}, JSON out {selected, directions, evals, decision_log, trace, provider_used}.
GET /healthz — liveness.

Keys are read from the request or the host env only — never logged, stored, or echoed. Container: docker build -f deploy/Dockerfile.web -t discovery-agents-web . then docker run --rm -p 8000:8000 -e GEMINI_API_KEY=$GEMINI_API_KEY discovery-agents-web. See docs/webapp.md for deploy notes (Render/Fly/Cloud Run).

Architecture

flowchart LR
  subgraph Interfaces
    CLI[CLI] ; MCP[MCP server - stdio]
  end
  subgraph Runtime["Runtime: typed state machine + ReAct loop"]
    SM[StateMachine] --- LG[optional LangGraph adapter]
  end
  subgraph Agents["9 agents - LLM-backed"]
    EI[EvidenceInsight] --> ST[Strategy] --> ID[Ideation] --> CR[Critique] --> CA[Canvas] --> SE[Selection] --> HO[Handoff] --> ME[Memory]
  end
  subgraph Capabilities
    LLM[LLM adapter: Claude/Cohere/GPT/Gemini/Mock]
    RAG[Retrieval: embeddings + vector store]
    TOOLS[Tools: evidence_search, web_search, calculator]
    GR[Guardrails: input/output]
    OBS[Observability: latency/tokens/cost]
  end
  EVAL[Eval harness: LLM-judge + metrics + regression gate]
  CLI --> Runtime ; MCP --> Runtime
  Runtime --> Agents
  Agents -. uses .-> LLM ; Agents -. uses .-> RAG ; ID -. ReAct .-> TOOLS
  Agents -. wrapped by .-> GR ; Agents -. emit .-> OBS
  Runtime --> EVAL

See docs/architecture.md for the module-by-module breakdown.

What's inside

Provider-agnostic LLM layer (llm/) — one LLMClient protocol; deterministic keyless MockLLMClient (default) + lazy Anthropic / Cohere / OpenAI adapters with tool-calling and structured-output support.
RAG (retrieval/) — Embedder + deterministic HashingEmbedder, a VectorStore (in-memory cosine + lazy Pinecone adapter), and an EvidenceIndex that returns cited passages.
ReAct runtime (runtime/) — a typed StateMachine (topological ordering, cycle detection, a trace span per node) and an LLMAgent plan-execute loop with tool use, step budgets, and full step tracing; plus an optional LangGraph adapter that runs the same graph.
Tools (tools/) — a Tool protocol + registry, a RAG evidence_search, a safe AST calculator, and a mockable web_search. See docs/adding-a-tool.md.
Guardrails (guardrails/) — input (PII redaction, prompt-injection block) and output (citation-required, groundedness, schema) checks, each recorded to the trace.
Observability (observability/) — a Trace of spans carrying latency, token usage, and cost, rendered to Markdown and an HTML timeline.
Evaluation (eval/) — deterministic quality metrics + an LLM-as-judge (faithfulness / relevance / helpfulness), aggregated into a scorecard and gated against a committed baseline.json in CI. See docs/evals.md.
MCP server (mcp_server.py) — exposes discovery_run, evidence_search, and eval_run over stdio for Claude Desktop / Claude Code. See docs/mcp.md.

ML-infrastructure platform (`mlinfra/`)

The retrieval embedder isn't a toy — it can be trained by a production ML-infra platform and served back through the same Embedder protocol:

Distributed PyTorch training — a custom SimCSE/InfoNCE training loop with DDP (gloo on CPU, nccl on GPU) and fault-tolerant, resumable checkpointing (atomic, SIGTERM-safe; resume reproduces the trajectory exactly — a tested invariant).
GPU-native data I/O — a multi-backend tensor archive (Numpy / Zarr / HDF5) with a GPU-native DataLoader, plus an I/O benchmark (samples/s, MB/s, p50/p95) and a resource profiler.
Distributed data curation — a pluggable Executor (Local / Dask) that tokenizes and shards a corpus into the archive.
MLOps — experiment/artifact tracking (MLflow or a keyless JSON tracker), a Dockerfile, docker-compose, and a Kubernetes Indexed-Job training manifest.

Benchmark result — it actually works

On a real task (Banking77 intent retrieval, 9,993 train / 3,076 test, 77 intents; relevant = same intent), the trained embedder beats the lexical baseline — a measured, reproducible result, not a demo (benchmark/RESULTS.md):

embedder	hit@1	hit@5	MRR	mAP
hashing (lexical baseline)	0.769	0.922	0.835	0.503
torch (supervised contrastive)	0.830	0.911	0.865	0.775
sentence-transformers (reference)	0.921	0.970	0.942	0.842

hit@k = fraction of queries with ≥1 same-intent neighbor in top-k (success@k). The trained model wins hit@1 (+6pts), MRR, and mAP (+27pts) over lexical (which edges it out on hit@5/@10); a pretrained sentence-transformers model is the reference upper bound. Reproduce in ~80s on CPU:

pip install -e ".[ml,dask,benchmark,st]"
python -m discovery_agents.mlinfra.cli benchmark --full --with-st   # writes benchmark/RESULTS.md
python -m discovery_agents.mlinfra.cli train --smoke                # curate -> train -> export, CPU
DISCOVERY_EMBEDDER=torch discovery-agents                           # the agent RAG uses the trained embedder
discovery-agents --dataset banking77                                # run the agents on real support messages

Trains CPU-first, but is written for multi-GPU / petabyte scale. Full details in docs/ml-platform.md.

Capabilities at a glance

Capability	Where
Production engineering (typed, tested, observable, CI)	strict `mypy`, `ruff`, `pytest`, GitHub Actions, `observability/`
Agentic architectures (ReAct / plan-execute, tools/APIs)	`runtime/agent.py`, `tools/`
Provider-agnostic LLM layer + RAG + vector store + LangGraph	`llm/`, `retrieval/`, `runtime/langgraph_adapter.py`
Rigorous evaluation (accuracy / safety / latency)	`eval/` harness + LLM-judge + CI regression gate
Reliable, observable, safe, auditable	`guardrails/`, `observability/`, decision memory

Project layout

src/discovery_agents/
  llm/            # LLMClient protocol, mock + Anthropic/Cohere/OpenAI/Gemini adapters, factory
  retrieval/      # embeddings, vector store, evidence index (RAG)
  tools/          # Tool protocol + registry; evidence_search, calculator, web_search
  runtime/        # state machine, ReAct agent loop, discovery graph, LangGraph adapter
  agents/         # the 9 discovery agents (LLM-backed, with deterministic baselines)
  guardrails/     # input/output safety checks + pipeline
  observability/  # trace spans (latency/tokens/cost), cost table, HTML report
  eval/           # metrics, LLM-as-judge, harness, committed baseline.json
  config.py       # RunConfig (provider/model/flags), env-driven
  pipeline.py     # builds the graph + capabilities and runs it
  cli.py          # discovery-agents entry point
  webapp.py       # FastAPI live demo (UI + /api/run + /healthz); webdata.py = demo datasets
  mcp_server.py   # MCP server (discovery-agents-mcp)
tests/            # deterministic, keyless tests (+ an ml-infra suite under tests/mlinfra)
docs/             # architecture, evals, adding-a-tool, mcp + design specs

Development

ruff check src tests && ruff format src tests   # lint + format
mypy                                            # strict type check
pytest -q                                       # tests (keyless, deterministic)

CI runs all of the above plus the eval regression gate on Python 3.9–3.12. See CONTRIBUTING.md.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discovery-agents

30-second quickstart (no API key)

Run it on a real frontier model

Live demo (FastAPI)

Architecture

What's inside

ML-infrastructure platform (`mlinfra/`)

Benchmark result — it actually works

Capabilities at a glance

Project layout

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
benchmark		benchmark
deploy		deploy
docs		docs
src/discovery_agents		src/discovery_agents
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

discovery-agents

30-second quickstart (no API key)

Run it on a real frontier model

Live demo (FastAPI)

Architecture

What's inside

ML-infrastructure platform (mlinfra/)

Benchmark result — it actually works

Capabilities at a glance

Project layout

Development

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ML-infrastructure platform (`mlinfra/`)

Packages