A production-grade, LLM-powered multi-agent workflow for enterprise product discovery — with RAG, a ReAct tool-use loop, a rigorous evaluation harness, guardrails, full observability, and an MCP server. Keyless by default, real on demand.
Given a product brief and a corpus of customer evidence, a graph of nine specialized agents clusters the evidence into insights, generates several evidence-grounded product directions, critiques and scores them, selects a winner, and emits a coding-agent-ready handoff packet — every step traced, evaluated, and guarded.
Why it's built this way. You can run the whole thing in 30 seconds with no
API key (a deterministic mock provider + in-memory retrieval). Set
ANTHROPIC_API_KEY (or --provider cohere/openai) and the same graph runs on a
frontier model. CI, tests, and the eval regression gate all run keyless and
deterministically.
pip install -e ".[dev]"
discovery-agents --output outputs/demo # runs on the deterministic mock provider
discovery-agents --eval # run the evaluation harness + regression gate
pytest -q # deterministic, keyless test suiteoutputs/demo/ gets: run_summary.md, coding_agent_handoff.md, canvas.html (visual
canvas + eval dashboard + trace timeline), agent_run.json, and agent_trace.md.
pip install -e ".[anthropic]" # or .[cohere] / .[openai] / .[gemini]
export ANTHROPIC_API_KEY=...
discovery-agents --provider anthropic --model claude-sonnet-4-6 --output outputs/liveProviders are pluggable behind one LLMClient protocol; if a key or SDK is missing the
factory logs a warning and falls back to the mock, so the workflow always runs. The
reasoning agents (EvidenceInsight, Strategy, Ideation, Critique, Handoff, Memory) call the
model and parse structured JSON; Canvas layout and Selection stay deterministic.
A deployable web app runs the whole pipeline with the reasoning agents on a real LLM (default Gemini). Keyless by default; bring your own key per request, or set a host key for a rate-limited number of free runs.
pip install -e ".[web,gemini]"
uvicorn discovery_agents.webapp:app --reload # open http://localhost:8000
# optional: a host key enables free server-side runs (visitors can also paste their own)
export GEMINI_API_KEY=... # never commit thisGET /— form: pick a dataset (enterprise sample / Banking77-style support), a provider, an optional API key, and a goal override; renders the selected direction, all directions with scores, the eval scorecard, decision memory, and trace totals.POST /api/run— JSON in{provider?, api_key?, dataset?, brief?, evidence?}, JSON out{selected, directions, evals, decision_log, trace, provider_used}.GET /healthz— liveness.
Keys are read from the request or the host env only — never logged, stored, or echoed.
Container: docker build -f deploy/Dockerfile.web -t discovery-agents-web . then
docker run --rm -p 8000:8000 -e GEMINI_API_KEY=$GEMINI_API_KEY discovery-agents-web.
See docs/webapp.md for deploy notes (Render/Fly/Cloud Run).
flowchart LR
subgraph Interfaces
CLI[CLI] ; MCP[MCP server - stdio]
end
subgraph Runtime["Runtime: typed state machine + ReAct loop"]
SM[StateMachine] --- LG[optional LangGraph adapter]
end
subgraph Agents["9 agents - LLM-backed"]
EI[EvidenceInsight] --> ST[Strategy] --> ID[Ideation] --> CR[Critique] --> CA[Canvas] --> SE[Selection] --> HO[Handoff] --> ME[Memory]
end
subgraph Capabilities
LLM[LLM adapter: Claude/Cohere/GPT/Gemini/Mock]
RAG[Retrieval: embeddings + vector store]
TOOLS[Tools: evidence_search, web_search, calculator]
GR[Guardrails: input/output]
OBS[Observability: latency/tokens/cost]
end
EVAL[Eval harness: LLM-judge + metrics + regression gate]
CLI --> Runtime ; MCP --> Runtime
Runtime --> Agents
Agents -. uses .-> LLM ; Agents -. uses .-> RAG ; ID -. ReAct .-> TOOLS
Agents -. wrapped by .-> GR ; Agents -. emit .-> OBS
Runtime --> EVAL
See docs/architecture.md for the module-by-module breakdown.
- Provider-agnostic LLM layer (
llm/) — oneLLMClientprotocol; deterministic keylessMockLLMClient(default) + lazy Anthropic / Cohere / OpenAI adapters with tool-calling and structured-output support. - RAG (
retrieval/) —Embedder+ deterministicHashingEmbedder, aVectorStore(in-memory cosine + lazy Pinecone adapter), and anEvidenceIndexthat returns cited passages. - ReAct runtime (
runtime/) — a typedStateMachine(topological ordering, cycle detection, a trace span per node) and anLLMAgentplan-execute loop with tool use, step budgets, and full step tracing; plus an optional LangGraph adapter that runs the same graph. - Tools (
tools/) — aToolprotocol + registry, a RAGevidence_search, a safe ASTcalculator, and a mockableweb_search. Seedocs/adding-a-tool.md. - Guardrails (
guardrails/) — input (PII redaction, prompt-injection block) and output (citation-required, groundedness, schema) checks, each recorded to the trace. - Observability (
observability/) — aTraceof spans carrying latency, token usage, and cost, rendered to Markdown and an HTML timeline. - Evaluation (
eval/) — deterministic quality metrics + an LLM-as-judge (faithfulness / relevance / helpfulness), aggregated into a scorecard and gated against a committedbaseline.jsonin CI. Seedocs/evals.md. - MCP server (
mcp_server.py) — exposesdiscovery_run,evidence_search, andeval_runover stdio for Claude Desktop / Claude Code. Seedocs/mcp.md.
The retrieval embedder isn't a toy — it can be trained by a production ML-infra
platform and served back through the same Embedder protocol:
- Distributed PyTorch training — a custom SimCSE/InfoNCE training loop with DDP
(
glooon CPU,ncclon GPU) and fault-tolerant, resumable checkpointing (atomic, SIGTERM-safe; resume reproduces the trajectory exactly — a tested invariant). - GPU-native data I/O — a multi-backend tensor archive (
Numpy/ Zarr / HDF5) with a GPU-nativeDataLoader, plus an I/O benchmark (samples/s, MB/s, p50/p95) and a resource profiler. - Distributed data curation — a pluggable
Executor(Local / Dask) that tokenizes and shards a corpus into the archive. - MLOps — experiment/artifact tracking (MLflow or a keyless JSON tracker), a
Dockerfile,
docker-compose, and a Kubernetes Indexed-Job training manifest.
On a real task (Banking77 intent retrieval, 9,993 train / 3,076 test, 77 intents; relevant
= same intent), the trained embedder beats the lexical baseline — a measured, reproducible
result, not a demo (benchmark/RESULTS.md):
| embedder | hit@1 | hit@5 | MRR | mAP |
|---|---|---|---|---|
| hashing (lexical baseline) | 0.769 | 0.922 | 0.835 | 0.503 |
| torch (supervised contrastive) | 0.830 | 0.911 | 0.865 | 0.775 |
| sentence-transformers (reference) | 0.921 | 0.970 | 0.942 | 0.842 |
hit@k = fraction of queries with ≥1 same-intent neighbor in top-k (success@k). The trained
model wins hit@1 (+6pts), MRR, and mAP (+27pts) over lexical (which edges it out on
hit@5/@10); a pretrained sentence-transformers model is the reference upper bound. Reproduce
in ~80s on CPU:
pip install -e ".[ml,dask,benchmark,st]"
python -m discovery_agents.mlinfra.cli benchmark --full --with-st # writes benchmark/RESULTS.md
python -m discovery_agents.mlinfra.cli train --smoke # curate -> train -> export, CPU
DISCOVERY_EMBEDDER=torch discovery-agents # the agent RAG uses the trained embedder
discovery-agents --dataset banking77 # run the agents on real support messagesTrains CPU-first, but is written for multi-GPU / petabyte scale. Full details in
docs/ml-platform.md.
| Capability | Where |
|---|---|
| Production engineering (typed, tested, observable, CI) | strict mypy, ruff, pytest, GitHub Actions, observability/ |
| Agentic architectures (ReAct / plan-execute, tools/APIs) | runtime/agent.py, tools/ |
| Provider-agnostic LLM layer + RAG + vector store + LangGraph | llm/, retrieval/, runtime/langgraph_adapter.py |
| Rigorous evaluation (accuracy / safety / latency) | eval/ harness + LLM-judge + CI regression gate |
| Reliable, observable, safe, auditable | guardrails/, observability/, decision memory |
src/discovery_agents/
llm/ # LLMClient protocol, mock + Anthropic/Cohere/OpenAI/Gemini adapters, factory
retrieval/ # embeddings, vector store, evidence index (RAG)
tools/ # Tool protocol + registry; evidence_search, calculator, web_search
runtime/ # state machine, ReAct agent loop, discovery graph, LangGraph adapter
agents/ # the 9 discovery agents (LLM-backed, with deterministic baselines)
guardrails/ # input/output safety checks + pipeline
observability/ # trace spans (latency/tokens/cost), cost table, HTML report
eval/ # metrics, LLM-as-judge, harness, committed baseline.json
config.py # RunConfig (provider/model/flags), env-driven
pipeline.py # builds the graph + capabilities and runs it
cli.py # discovery-agents entry point
webapp.py # FastAPI live demo (UI + /api/run + /healthz); webdata.py = demo datasets
mcp_server.py # MCP server (discovery-agents-mcp)
tests/ # deterministic, keyless tests (+ an ml-infra suite under tests/mlinfra)
docs/ # architecture, evals, adding-a-tool, mcp + design specs
ruff check src tests && ruff format src tests # lint + format
mypy # strict type check
pytest -q # tests (keyless, deterministic)CI runs all of the above plus the eval regression gate on Python 3.9–3.12. See
CONTRIBUTING.md.
MIT — see LICENSE.