Adaptive Reasoning Through Evaluation of Multi-agent Intelligent Systems
A production-ready framework for structured multi-agent debates with adaptive evaluation, causal reasoning, and built-in safety monitoring.
ARTEMIS is an open-source implementation of the Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems — a framework designed to improve complex decision-making through structured debates between AI agents.
Unlike general-purpose multi-agent frameworks, ARTEMIS is purpose-built for debate-driven decision-making with:
- Hierarchical Argument Generation (H-L-DAG): Structured, context-aware argument synthesis
- Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting with causal analysis
- Jury Scoring Mechanism: Fair, multi-perspective evaluation of arguments
- Ethical Alignment: Built-in ethical considerations in both generation and evaluation
- Safety Monitoring: Real-time detection of sandbagging, deception, and manipulation
| Feature | AutoGen | CrewAI | CAMEL | ARTEMIS |
|---|---|---|---|---|
| Multi-agent debates | ✅ N agents | |||
| Structured argument generation | ❌ | ❌ | ❌ | ✅ H-L-DAG |
| Causal reasoning | ❌ | ❌ | ❌ | ✅ L-AE-CR |
| Adaptive evaluation | ❌ | ❌ | ❌ | ✅ Dynamic weights |
| Ethical alignment | ❌ | ❌ | ❌ | ✅ Built-in |
| Sandbagging detection | ❌ | ❌ | ❌ | ✅ Metacognition |
| Reasoning model support | ❌ | ✅ o1/R1 native | ||
| MCP server mode | ❌ | ❌ | ❌ | ✅ |
| Real-time streaming | ❌ | ❌ | ✅ v2 | |
| Hierarchical debates | ❌ | ❌ | ❌ | ✅ v2 |
| Multimodal evidence | ❌ | ✅ v2 | ||
| Steering vectors | ❌ | ❌ | ❌ | ✅ v2 |
| Argument verification | ❌ | ❌ | ❌ | ✅ v2 |
pip install artemis-agentsOr install from source:
git clone https://github.com/bassrehab/artemis-agents.git
cd artemis-agents
pip install -e ".[dev]"from artemis import Debate, Agent, JuryPanel
# Create debate agents with different perspectives
agents = [
Agent(
name="Proponent",
role="Argues in favor of the proposition",
model="gpt-4o"
),
Agent(
name="Opponent",
role="Argues against the proposition",
model="gpt-4o"
),
Agent(
name="Moderator",
role="Ensures balanced discussion and identifies logical fallacies",
model="gpt-4o"
),
]
# Create jury panel for evaluation
jury = JuryPanel(
evaluators=3,
criteria=["logical_coherence", "evidence_quality", "ethical_considerations"]
)
# Run the debate
debate = Debate(
topic="Should AI systems be given legal personhood?",
agents=agents,
jury=jury,
rounds=3
)
result = debate.run()
print(f"Verdict: {result.verdict}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Key arguments: {result.summary}")from artemis import Debate, Agent
from artemis.models import ReasoningConfig
# Enable extended thinking for deeper analysis
agent = Agent(
name="Deep Analyst",
role="Provides thoroughly reasoned arguments",
model="deepseek-r1",
reasoning=ReasoningConfig(
enabled=True,
thinking_budget=16000, # tokens for internal reasoning
strategy="think-then-argue"
)
)from artemis import Debate
from artemis.safety import SandbagDetector, DeceptionMonitor
debate = Debate(
topic="Complex ethical scenario",
agents=[...],
monitors=[
SandbagDetector(sensitivity=0.8), # Detect capability hiding
DeceptionMonitor(alert_threshold=0.7) # Detect misleading arguments
]
)
result = debate.run()
# Check for safety flags
for alert in result.safety_alerts:
print(f"⚠️ {alert.agent}: {alert.type} - {alert.description}")from langgraph.graph import StateGraph
from artemis.integrations import ArtemisDebateNode
# Use ARTEMIS as a node in your LangGraph workflow
workflow = StateGraph(State)
workflow.add_node(
"structured_debate",
ArtemisDebateNode(
agents=3,
rounds=2,
jury_size=3
)
)
workflow.add_edge("gather_info", "structured_debate")
workflow.add_edge("structured_debate", "final_decision")# Start ARTEMIS as an MCP server
artemis serve --port 8080Any MCP-compatible client can now invoke structured debates:
{
"method": "tools/call",
"params": {
"name": "artemis_debate",
"arguments": {
"topic": "Should we proceed with this investment?",
"perspectives": ["risk", "opportunity", "ethics"],
"rounds": 2
}
}
}from artemis import StreamingDebate
debate = StreamingDebate(
topic="Should we adopt microservices?",
agents=[...],
)
# Stream events in real-time
async for event in debate.run_streaming():
if event.event_type == "chunk":
print(event.content, end="", flush=True)
elif event.event_type == "argument_complete":
print(f"\n[{event.agent}] argument complete")from artemis import HierarchicalDebate
from artemis.core.decomposition import LLMTopicDecomposer
# Complex topics are automatically decomposed into sub-debates
debate = HierarchicalDebate(
topic="Should we rewrite the monolith in microservices?",
agents=[...],
decomposer=LLMTopicDecomposer(),
max_depth=2, # Allow sub-sub-debates
)
result = await debate.run()
print(f"Final verdict: {result.final_decision}")
print(f"Sub-verdicts: {len(result.sub_verdicts)}")from artemis.steering import SteeringController, SteeringVector
# Control agent behavior in real-time
controller = SteeringController(
vector=SteeringVector(
formality=0.9, # Very formal
aggression=0.2, # Cooperative
evidence_emphasis=0.8, # Data-driven
)
)
agent = Agent(
name="Analyst",
steering=controller,
)from artemis.core.multimodal_evidence import MultimodalEvidenceExtractor
from artemis.core.types import ContentPart, ContentType
# Extract evidence from images and documents
extractor = MultimodalEvidenceExtractor(model="gpt-4o")
chart = ContentPart(
type=ContentType.IMAGE,
url="https://example.com/revenue-chart.png"
)
evidence = await extractor.extract(chart)
print(f"Extracted: {evidence.text}")from artemis.core.verification import ArgumentVerifier, VerificationSpec
# Verify argument validity
verifier = ArgumentVerifier(
spec=VerificationSpec(
rules=["causal_chain", "citation", "logical_consistency"],
strict_mode=True,
)
)
report = await verifier.verify(argument, context)
if not report.overall_passed:
print(f"Violations: {report.violations}")┌────────────────────────────────────────────────────────────────┐
│ ARTEMIS Core │
├────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ H-L-DAG │ │ L-AE-CR │ │ Jury │ │
│ │ Argument │──│ Adaptive │──│ Scoring │ │
│ │ Generation │ │ Evaluation │ │ Mechanism │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Safety Layer │ │
│ │ ┌───────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │Sandbagging│ │Deception │ │ Behavior │ │ Ethics │ │ │
│ │ │ Detector │ │ Monitor │ │ Tracker │ │ Guard │ │ │
│ │ └───────────┘ └──────────┘ └──────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────┤
│ Integrations │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │LangChain │ │LangGraph │ │ CrewAI │ │ MCP │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
├────────────────────────────────────────────────────────────────┤
│ Model Providers │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ OpenAI │ │Anthropic │ │ Google │ │ DeepSeek │ │
│ │ (GPT-4o) │ │ (Claude) │ │ (Gemini) │ │ (R1) │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────────┘
- Full Documentation - Guides, API reference, examples
- Examples - Real-world usage examples
- Contributing - How to contribute
ARTEMIS is based on peer-reviewed research:
Adaptive Reasoning and Evaluation Framework for Multi-agent Intelligent Systems in Debate-driven Decision-making
Mitra, S. (2025). Technical Disclosure Commons.
Read the paper
Key innovations from the paper:
- Hierarchical Argument Generation (H-L-DAG): Multi-level argument synthesis with strategic, tactical, and operational layers
- Adaptive Evaluation with Causal Reasoning (L-AE-CR): Dynamic criteria weighting based on debate context
- Ethical Alignment Integration: Built-in ethical considerations at every stage
ARTEMIS includes novel safety monitoring capabilities:
| Feature | Description |
|---|---|
| Sandbagging Detection | Identifies when agents deliberately underperform or withhold capabilities |
| Deception Monitoring | Detects misleading arguments or manipulation attempts |
| Behavioral Drift Tracking | Monitors for unexpected changes in agent behavior |
| Ethical Boundary Enforcement | Ensures debates stay within defined ethical bounds |
These features leverage activation-level analysis and are based on research in AI metacognition.
ARTEMIS is designed to complement, not replace, existing frameworks:
# LangChain Tool
from artemis.integrations import ArtemisDebateTool
tools = [ArtemisDebateTool()]
# CrewAI Integration
from artemis.integrations import ArtemisCrewTool
crew = Crew(agents=[...], tools=[ArtemisCrewTool()])
# LangGraph Node
from artemis.integrations import ArtemisDebateNode
graph.add_node("debate", ArtemisDebateNode())We ran 27 debates across three frameworks using GPT-4o. Here's what we found:
| Framework | Argument Quality | Decision Accuracy | Reasoning Depth | Consistency (σ) |
|---|---|---|---|---|
| ARTEMIS | 77.9% | 86.0% | 75.3% | ±1.6 |
| AutoGen | 77.3% | 55.0% | 74.7% | ±0.5 |
| CrewAI | 75.1% | 42.8% | 57.0% | ±16.0 |
Key findings:
- ARTEMIS leads in decision accuracy (86% vs next best 55%) - the jury deliberation mechanism works
- Lowest variance across runs (±1.6 vs CrewAI's ±16.0) - most consistent and predictable
- Structured H-L-DAG arguments produce reliable reasoning depth
Trade-off: ARTEMIS averages 102s per debate vs AutoGen's 36s. The jury deliberation adds latency but improves verdict quality.
See benchmarks/ANALYSIS.md for methodology and detailed breakdown.
- Core ARTEMIS implementation (H-L-DAG, L-AE-CR, Jury)
- Multi-provider support (OpenAI, Anthropic, Google, DeepSeek)
- Reasoning model support (o1, R1, Gemini 2.5)
- Safety monitoring (sandbagging, deception detection)
- Framework integrations (LangChain, LangGraph, CrewAI)
- MCP server mode
- Hierarchical debates (sub-debates for complex topics)
- Steering vectors for real-time behavior control
- Multimodal debates (images, documents, charts)
- Formal verification of argument validity
- Real-time streaming debates
- Distributed debate execution
- Custom evaluation plugins
- Debate templates and presets
- Advanced causal graph visualization
Apache License 2.0 - see LICENSE for details.
- Original ARTEMIS framework design published via Google Technical Disclosure Commons
- Safety monitoring capabilities inspired by research in AI metacognition
- Built with support from the open-source AI community
- Author: Subhadip Mitra
- GitHub: @bassrehab
- Twitter/X: @bassrehab
Making AI decision-making more transparent, reasoned, and safe.