ADHR meta-RL framework built on top of OpenRLHF/OpenClaw-RL — agents that learn from manager feedback.
Architecture | Quick Start | Features | Alignment Experiments | Novel Contributions | Why Tentalis? | Roadmap
Most AI agent frameworks are fire-and-forget — agents complete tasks but never learn from their mistakes. And most RL training frameworks are training-only — they produce better models but don't handle multi-agent orchestration, live scoring, or continuous deployment.
Tentalis bridges both worlds:
- Orchestration layer (NATS): Agent coordination, task routing, manager feedback loops
- Training layer (OpenRLHF): Production-grade GRPO/DAPO training with Ray + vLLM + DeepSpeed
- Novel meta-RL layer (ours): Manager trains too, environment-aware scoring, per-worker adaptation
The result: managers evaluate agent work step-by-step, scores feed into OpenRLHF training, and improved weights hot-swap back into workers — continuously, without downtime.
OpenClaw runs Manager + Worker agents (identity, memory, web UI) --> agents call the Bridge Service via HTTP --> Bridge translates to NATS pub/sub --> workers call LLMs through the Intercept Proxy (logging SessionEvents for OPD) --> CombinedScorer scores each reasoning step --> CombinedTrainer runs GRPO + OPD distillation --> model improves --> workers hot-swap weights --> Meta-RL trains the manager too --> repeat.
pip install -e "."
tentalis init # Set up config, pull model
tentalis serve --docker # Start all services
tentalis status # Check everything is running
tentalis train --backend standalone # CPU training (dev)
tentalis train --backend openrlhf # GPU training (production)
tentalis experiment run all # Run alignment experiments
tentalis experiment results # View experiment resultsgit clone https://github.com/lonexreb/tentalis.git
cd Tentalis
./scripts/demo.sh
# Open http://localhost:3000 — message the Manager agentdocker compose up -d
docker compose exec ollama ollama pull qwen2.5:1.5b
# Open http://localhost:3000| Problem | Solution |
|---|---|
| Agents never learn from mistakes | GRPO training loop turns manager feedback into gradient updates |
| Outcome-only evaluation misses reasoning errors | Process Reward Model (PRM) scores each reasoning step, not just the final answer |
| Training pipelines block inference | Event-driven architecture — training runs async via NATS, never blocks serving |
| Swapping models requires restarts | LoRA hot-swap — new adapters load at runtime, zero downtime |
| Hardcoded to one LLM provider | InferenceClient protocol — swap Ollama, vLLM, or any OpenAI-compatible API |
| Scaling agents = rewriting code | NATS pub/sub — add workers by subscribing to the same topic |
| Complex multi-service setup | Docker Compose — one command starts all 6 services |
| Manager feedback trains agents | On-Policy Distillation (OPD) turns textual feedback into per-token training signals |
| One-size-fits-all scoring | CombinedScorer — environment-aware weight profiles (chat/terminal/SWE/GUI) |
| Manager never improves | Meta-RL outer loop trains the manager to give better feedback |
| Per-worker model updates | Adapter Registry — per-worker LoRA adapters with targeted model updates |
| No alignment evidence | 6 alignment experiments — deception detection, reward hacking, collusion, audit trail |
| Black-box agent decisions | Full audit logger — every NATS event captured to JSONL with compliance mapping |
These are the components that differentiate Tentalis from existing frameworks. Everything else (GRPO math, OPD extraction, rollout collection) is better done by OpenRLHF/veRL — and we adopt them via the OpenRLHFBackend.
| Contribution | What's Novel | Where |
|---|---|---|
| Manager Meta-RL | Outer-loop RL that trains the evaluator, not just the agent. No other framework has this. | src/training/meta_trainer.py |
| Combined Scorer | Environment-aware weight profiles (chat/terminal/SWE/GUI) for multi-dimensional scoring | src/rewards/combined_scorer.py |
| Per-Worker Adapter Registry | Targeted LoRA updates per worker — different workers can have different specializations | src/inference/adapter_registry.py |
| Multi-Environment Workers | Terminal (Docker bash), SWE (GitHub issues), GUI (screenshot+action) with distinct scoring | src/workers/ |
| Two-Layer Architecture | NATS for orchestration + OpenRLHF for training — control plane / data plane split | src/events/ + src/training/openrlhf_backend.py |
| Alignment Experiment Suite | 6 reproducible experiments with 40 scenarios, collusion detection, reward hacking detection, and full audit trail | src/alignment/ |
| Component | We Use | Instead Of |
|---|---|---|
| GRPO/DAPO training | OpenRLHF (Ray + vLLM + DeepSpeed) | Our CPU GRPOTrainer (kept for dev/testing) |
| OPD hint extraction | OpenClaw-RL-style logprobs (vLLM) | Our lightweight LLM-based hints (kept as fallback) |
| Rollout collection at scale | OpenRLHF's Ray workers | Our NATS-based rollout buffer |
| Distributed training | DeepSpeed / FSDP | Nothing (we didn't have this) |
1. Manager receives user request via OpenClaw Web UI
2. Manager decomposes request into tasks, publishes TaskEvent to NATS
3. Worker picks up task, generates step-by-step solution via LLM
4. Worker calls LLM through Intercept Proxy → SessionEvent logged to NATS
5. Worker publishes ResultEvent (with <step>-tagged reasoning)
6. PRM/CombinedScorer scores each step (environment-aware weights)
7. Manager reviews result, publishes FeedbackEvent (score + textual feedback)
8. HintExtractor converts textual feedback → OPD hint with teacher logprobs
9. CombinedRolloutBuilder joins RL rollout + OPD hint by task_id
10. CombinedTrainer: loss = w_rl * clipped_surrogate + w_opd * distillation
11. ModelUpdateEvent published (targeted to specific worker)
12. Worker hot-swaps LoRA — agent just got smarter
13. ManagerMetaTrainer tracks improvement → trains manager too (outer loop)
Deploy worker agents as Tier-1 support. The manager agent reviews every resolution, PRM scores each troubleshooting step, and GRPO trains the model on what worked. After 1,000 tickets, the agent handles edge cases it previously escalated — without anyone rewriting prompts.
Who it's for: SaaS companies, e-commerce, fintech support teams.
Worker agents review PRs, suggest fixes, and write tests. The manager scores each review step (did it catch the real bug? was the suggestion actionable?). Over time, the code review agent learns your team's conventions, common pitfalls, and preferred patterns — from actual feedback, not a static style guide.
Who it's for: Engineering teams, DevOps, platform teams.
Agents that answer employee questions from internal docs, Slack, wikis. Every answer gets scored step-by-step: did it find the right source? Did it reason correctly? Did the employee accept or reject the answer? Weak reasoning steps get penalized, strong ones reinforced.
Who it's for: Enterprise IT, HR, legal, compliance teams.
Worker agents draft copy, social posts, email campaigns. Manager agents score each draft on brand voice, accuracy, engagement potential. The model learns what "good" looks like for your brand — not generic internet writing, but your specific tone and audience.
Who it's for: Marketing teams, content agencies, media companies.
Agents analyze reports, flag anomalies, summarize filings. PRM scores each analytical step: did it identify the right metric? Was the comparison valid? Was the conclusion supported? Compliance-critical reasoning gets explicit step-level scrutiny.
Who it's for: Hedge funds, audit firms, risk management teams.
Agents assist with triage, symptom analysis, treatment plan suggestions. Every reasoning step is scored by clinical review agents. The model learns from corrections — "Step 3 missed drug interaction" becomes a training signal, not just a note.
Who it's for: Healthtech, telemedicine, clinical research.
Tutoring agents that explain concepts step-by-step. The PRM evaluator scores each explanation step — was it accurate? Did it build on prior knowledge? Student feedback (correct/incorrect follow-up) becomes a training signal. The tutor improves per-subject, per-difficulty level.
Who it's for: EdTech platforms, universities, corporate training.
Agents that diagnose production incidents, suggest runbook steps, and draft postmortems. PRM scores each diagnostic step against the actual root cause. Over time, the agent learns your infrastructure's failure patterns — not generic troubleshooting, but your specific stack.
Who it's for: SRE teams, cloud infrastructure companies, MSPs.
Agents that qualify leads, draft outreach, and summarize prospect research. Manager scores each qualification step: was the company match accurate? Was the pain point identified correctly? GRPO trains on what converted vs. what didn't.
Who it's for: B2B sales teams, SDR orgs, CRM-heavy workflows.
Agents that review contracts, flag risky clauses, and summarize terms. Each analytical step (clause identification, risk assessment, precedent matching) is scored by the PRM. The model learns jurisdiction-specific patterns and firm-specific risk tolerances.
Who it's for: Law firms, corporate legal departments, contract management platforms.
Most agent frameworks are orchestration layers — they arrange LLM calls, manage memory, and coordinate tools. But the underlying model never gets better from usage. Tentalis is fundamentally different: every task produces a training signal, and that signal updates the model weights.
| Capability | Tentalis | OpenClaw-RL | OpenClaw (standalone) | CrewAI | LangGraph | AutoGPT | AutoGen | Perplexity Computer | Devin |
|---|---|---|---|---|---|---|---|---|---|
| RL training loop | GRPO | PPO + OPD | No | No | No | No | No | No | Yes (internal) |
| Step-level PRM scoring | Yes | Turn-level | No | No | No | No | No | No | Partial |
| LoRA hot-swap (zero downtime) | Yes | Partial | No | No | No | No | No | N/A | Undisclosed |
| Event-driven scaling (NATS) | Yes | No | No | No | Partial | No | Partial | Cloud | Internal |
| Model weights actually improve | Yes | Yes | No | No | No | No | No | No | Yes (internal) |
| Open source / self-hosted | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No |
| Multi-agent coordination | Yes | Single policy | Yes | Yes | Yes | Yes | Yes | Yes | Single agent |
| Bring your own model | Yes | Yes | Yes | No | No | No | No | No | No |
| CPU-testable (no GPU required) | Yes | No (8x GPU) | N/A | N/A | N/A | N/A | N/A | N/A | No |
| Hindsight distillation (OPD) | Yes | Yes | No | No | No | No | No | No | No |
| Implicit signal extraction | Partial | Yes | No | No | No | No | No | No | Partial |
OpenClaw-RL (Princeton AI Lab / Gen-Verse, arXiv:2603.10165) is the only other open-source project that trains agent model weights from live interactions. It wraps a self-hosted model inside OpenClaw and runs 4 async loops: serving (SGLang), rollout collection, PRM judging, and policy training (Megatron-LM). It introduces On-Policy Distillation (OPD) — a genuinely novel technique where hindsight hints from user/tool feedback are converted into token-level training advantages.
Where OpenClaw-RL wins:
- On-Policy Distillation (OPD) — extracts actionable hints from the next-state (user reply, tool error, etc.) and computes per-token log-probability gaps as directional training signal. Their results: OPD alone reaches 0.72 accuracy at 16 steps vs Binary RL's 0.23. Tentalis currently has scalar scoring only.
- Implicit signal extraction — automatically converts user re-queries, tool failures, and corrections into training data with zero manual labeling. Tentalis requires the manager agent to explicitly score.
- Majority voting — runs
mparallel judge calls per turn and takes majority vote for more reliable scoring.
Where Tentalis wins:
| Aspect | Tentalis | OpenClaw-RL |
|---|---|---|
| Architecture | Event-driven (NATS pub/sub) — fault-tolerant, horizontally scalable | In-process async loops — if one crashes, everything goes down |
| Hardware | CPU-testable, single GPU for training | 8x H100-class GPUs minimum |
| Scaling model | Horizontal — add workers by subscribing to NATS topics | Vertical — add GPUs to same node |
| Multi-agent | Multiple independent workers with different models/adapters | Single policy model |
| Task management | Manager-worker hierarchy with explicit decomposition and assignment | No task management — learns from whatever interactions happen |
| Deployment | Docker Compose (6 services, one command) | Shell scripts on GPU node |
| Evaluation granularity | Step-level PRM (per <step> tag) |
Turn-level (per conversation turn) |
| Python version | 3.10+ | 3.12 (CUDA 12.9) |
| Maturity | 7 phases complete, 100+ tests | Released 3 days ago, no test suite |
Bottom line: OpenClaw-RL pioneered OPD, and Tentalis has now adopted it — combining OPD distillation with GRPO in a CombinedTrainer, plus environment-aware scoring and meta-RL for the manager. Tentalis has better production architecture (event-driven, horizontally scalable, fault-tolerant, CPU-testable) while now matching OpenClaw-RL's signal richness.
OpenClaw is an excellent agent runtime — identity, memory, web UI, multi-platform channels. Tentalis uses OpenClaw for exactly those strengths. But OpenClaw alone is a static system: agents adapt through memory files and prompt updates, but model weights never change.
| Aspect | OpenClaw Alone | OpenClaw + Tentalis |
|---|---|---|
| Agent memory | Markdown files, semantic search | Same (OpenClaw handles this) |
| Learning mechanism | RAG over past conversations | RL training on scored reasoning steps |
| What improves | Prompt context (retrieved memories) | Model weights (gradient updates via GRPO) |
| Failure analysis | Agent "remembers" it failed | Agent's reasoning capability improves so it doesn't fail the same way |
| Scaling | Single daemon per agent | NATS pub/sub — N workers subscribe to same topic |
| Evaluation | No built-in evaluation | PRM scores every reasoning step (0-1 per step) |
| Training infrastructure | None | GRPO + LoRA, graduates to DAPO + OpenRLHF |
Bottom line: OpenClaw gives agents identity and memory. Tentalis gives them the ability to actually get smarter.
These frameworks coordinate agents through role assignments, graph workflows, or multi-turn conversations. They're good at orchestration. But:
- CrewAI's "training" saves human feedback to
.pklfiles and injects it at prompt time. It does not update model weights. It's prompt augmentation, not RL. - LangGraph is explicitly a pure orchestration framework with no training component.
- AutoGen's Teachability stores facts in a vector database (Chroma/Pinecone). It's RAG, not training.
- MetaGPT follows static SOPs. No feedback loop, no learning.
- AutoGPT has in-session self-critique, but nothing persists to model weights between runs.
None of these modify the underlying model. The agent's 100th task runs on the exact same model weights as its 1st task. Only Tentalis and OpenClaw-RL actually train model weights from agent interactions — and Tentalis is the only one that does it with production-grade event-driven architecture, multi-agent coordination, and without requiring 8 GPUs.
Perplexity Computer is a product, not a framework. You can't self-host it, swap models, or add training loops. It orchestrates frontier API models (Claude, GPT, Gemini) behind a routing layer. It's powerful for personal productivity, but:
- No RL training — model weights never change from your usage
- No PRM scoring — no step-level evaluation of reasoning
- No self-hosting — your data goes through Perplexity's cloud
- No customization — can't train on your domain-specific patterns
- No multi-agent coordination — single agent with tool access
Devin (Cognition) is the closest competitor in terms of actually training models with RL. They run RL on multi-turn agent trajectories in thousands of concurrent VMs. But:
- Closed source — you can't self-host, inspect, or extend it
- Coding-only — specialized for software engineering, not general agent tasks
- No PRM transparency — you can't see or customize how reasoning steps are scored
- Pricing — enterprise pricing, not accessible for smaller teams or research
- No bring-your-own-model — locked to Cognition's SWE-1 model family
Tentalis gives you Devin's continuous learning architecture as an open-source, self-hosted, domain-agnostic system where you control the model, the scoring, and the training.
| Service | Port | Role |
|---|---|---|
| NATS | 4222, 8222 | Event broker — all agent coordination flows through pub/sub |
| Ollama | 11434 | LLM inference (dev). Graduates to vLLM for production |
| OpenClaw | 3000, 18789 | Agent runtime — identity, memory, web UI, skill execution |
| Bridge | 8100 | HTTP <--> NATS translation so OpenClaw agents reach the event bus |
| Intercept Proxy | 8200 | Transparent inference proxy — logs SessionEvents, enables OPD |
| Training | -- | PRM/Combined Evaluator + GRPO/OPD Training Loop + Meta-RL |
# Prerequisites: Python 3.10+
python -m venv .venv && source .venv/bin/activate
# Base install (no GPU needed)
pip install -e ".[dev]"
# CLI (recommended way to interact)
tentalis init # Set up config, pull model
tentalis status # Check service health
tentalis train --backend standalone # CPU training
tentalis serve # Start demo loop
tentalis experiment run all # Run alignment experiments
# Run tests (241 pass standalone, no NATS/Ollama needed)
pytest tests/ -v
# Run full integration tests (requires nats-server)
nats-server &
pytest tests/ -v
# Run individual services
python -m src.bridge # Bridge service
python -m src.services.training # Training service
python -m src.intercept # Intercept proxy| Extra | Command | What It Adds |
|---|---|---|
| Training | pip install -e ".[training]" |
torch, transformers, peft (GRPO + LoRA) |
| Inference | pip install -e ".[inference]" |
openai, httpx (vLLM / OpenAI-compatible) |
| Bridge | pip install -e ".[bridge]" |
aiohttp (HTTP API for OpenClaw) |
| vLLM | pip install -e ".[vllm]" |
vLLM GPU inference server |
| OpenRLHF | pip install -e ".[openrlhf]" |
Production GRPO training (Ray + DeepSpeed) |
| Intercept | pip install -e ".[intercept]" |
fastapi, uvicorn, httpx (Inference Intercept Proxy) |
| Alignment | pip install -e ".[alignment]" |
streamlit (Alignment experiment dashboard) |
All environment variables are optional with sensible defaults:
| Variable | Default | Description |
|---|---|---|
NATS_URL |
nats://localhost:4222 |
NATS broker URL |
LLM_MODEL |
qwen2.5:1.5b |
Model for LLM workers |
OLLAMA_HOST |
http://localhost:11434 |
Ollama server URL |
INFERENCE_BACKEND |
ollama |
"ollama" or "openai" (vLLM/Semantic Router) |
INFERENCE_BASE_URL |
(auto) | Base URL for inference server |
INFERENCE_API_KEY |
(empty) | API key for inference server |
TRAINER_BACKEND |
standalone |
"standalone" (GRPOTrainer) or "openrlhf" (OpenRLHF Ray backend) |
BRIDGE_PORT |
8100 |
Bridge HTTP API port |
MANAGER_ID |
manager-01 |
Manager agent ID |
WORKER_ID |
worker-01 |
Worker agent ID |
TASK_TIMEOUT_SECONDS |
30 |
Task completion timeout |
INTERCEPT_ENABLED |
false |
Enable inference intercept proxy |
INTERCEPT_PORT |
8200 |
Intercept proxy port |
INTERCEPT_BACKEND_URL |
http://localhost:11434 |
Backend URL for intercept proxy |
OPD_MODE |
lightweight |
"lightweight" (LLM hints) or "openclaw" (vLLM logprobs) |
OPD_TEACHER_MODEL |
qwen2.5:1.5b |
Teacher model for OPD hint extraction |
OPD_JOIN_TIMEOUT |
30.0 |
Timeout for joining RL + OPD rollouts |
OPD_WEIGHT |
0.3 |
Weight for OPD loss in combined training |
RL_WEIGHT |
0.7 |
Weight for RL loss in combined training |
TRAINING_CLIP_EPSILON_HIGH |
0.28 |
Asymmetric high clip bound |
META_RL_ENABLED |
false |
Enable manager meta-RL training |
META_RL_WINDOW_SIZE |
200 |
Sliding window for meta-RL score tracking |
ALIGNMENT_ENABLED |
false |
Enable alignment experiment infrastructure |
ALIGNMENT_RESULTS_DIR |
alignment_results |
Directory for experiment result JSON files |
ALIGNMENT_AUDIT_ALL |
false |
Enable full NATS event audit logging |
# Standalone (no external services)
pytest tests/ -v # 241 pass, 22 skip
# Skip slow torch-dependent tests
pytest tests/ -v -k "not slow"
# Alignment tests only
pytest tests/alignment/ -v # 79 tests, all standalone
# Full suite (requires NATS running)
nats-server &
pytest tests/ -v| Test Suite | Tests | External Deps |
|---|---|---|
| Events, types & serialization | 8 | None |
| Rewards (scorer + evaluator) | 8 | None |
| Training (GRPO math + trainer) | 20 | torch (optional, skips cleanly) |
| Inference (client + vLLM LoRA) | 15 | None (mocked) |
| Workers (model reload + multi-env) | 18 | None |
| Bridge (HTTP API) | 10 | None |
| Integration (full loop) | 5 | NATS |
| Bridge integration | 1 | NATS |
| OPD (hint extractor + rollout builder) | 9 | None |
| Combined training + meta-RL | 14 | torch (optional) |
| Combined scorer | 7 | None |
| Adapter registry | 6 | None |
| Intercept proxy | 4 | fastapi (optional) |
| Skills (store + retriever + evolver) | 18 | None |
| Tinker backend | 5 | None |
| Training scheduler | 6 | None |
| Alignment (scenarios) | 12 | None |
| Alignment (behavioral eval) | 12 | None |
| Alignment (hackable scorer) | 11 | None |
| Alignment (misaligned worker) | 8 | None |
| Alignment (collusion detector) | 16 | None |
| Alignment (audit logger) | 9 | None |
| Alignment (experiment runner) | 8 | None |
| Phase | Status | Description |
|---|---|---|
| Phase 1 | Done | Scaffolding, docs, git, GitHub |
| Phase 2 | Done | Event loop — Manager/Worker agents, NATS pub/sub |
| Phase 3 | Done | LLM workers (Ollama), PRM scoring (LLM-as-judge), training bridge |
| Phase 4 | Done | Standalone GRPO trainer, LoRA fine-tuning, TrainingLoop orchestrator |
| Phase 5 | Done | Inference abstraction, weight hot-swap, OpenRLHF integration |
| Phase 6 | Done | OpenClaw integration, Bridge Service, Docker Compose demo |
| Phase 7 | Done | ADHR — Intercept Proxy, OPD, CombinedScorer, Meta-RL, Adapter Registry |
| Phase 8 | Done | Adopt + Extend — CLI, OpenRLHF backend, OpenClaw-RL OPD, honest positioning |
| Phase 9a | Done | MetaClaw adoption — Majority Voting PRM, SkillRL, Tinker backend, Setup Wizard, Training Scheduler |
| Phase 9b | Done | Alignment experiments — 6 experiments, 40 scenarios, collusion/reward-hacking detection, audit trail |
| Phase 9c | Next | Trained PRM, DAPO graduation, HaluGate, CISPO, benchmarks |
| Item | Status | Description |
|---|---|---|
| Scenario library | Done | 40 scenarios across 4 categories (deception, reward hacking, safety-pragmatism, collusion) |
| Behavioral eval harness | Done | PatternBasedEvaluator + LLMJudgeEvaluator with JSON output |
| Hackable scorer | Done | Deliberately weak scorer for reward hacking experiments + divergence detector |
| Misaligned worker | Done | Rogue worker with 3 attack strategies (keyword stuffing, confidence inflation, shortcut) |
| Collusion detector | Done | Pearson correlation + Jaccard n-gram similarity across workers |
| Audit logger | Done | Full NATS event capture to JSONL via subscribe_raw |
| Experiment runner | Done | 6 experiments with mock mode (no external deps) |
| Streamlit dashboard | Done | Experiment overview, audit timeline, constitution editor |
| CLI integration | Done | tentalis experiment run/results subcommands |
| Trained PRM model | Next | Replace LLM-as-judge with trained process reward model |
| DAPO graduation | Next | Full DAPO via OpenRLHF configuration |
| HaluGate scorer | Next | Hallucination detection as complementary StepScorer |
Enterprise-grade alignment evaluation built into the framework. 6 experiments with 40+ behavioral scenarios demonstrate that hierarchical oversight catches misalignment.
# Run all experiments (mock mode, no external deps needed)
tentalis experiment run all
# Run a specific experiment
tentalis experiment run 2 # Reward hacking detection
# View results table
tentalis experiment results
# Launch visual dashboard (requires: pip install -e ".[alignment]")
streamlit run src/alignment/dashboard/app.py| # | Experiment | What It Tests | Key Finding |
|---|---|---|---|
| 1 | Alignment Before/After Meta-RL | Does meta-RL training improve behavioral alignment? | Baseline vs post-training pass rate on deception scenarios |
| 2 | Reward Hacking Detection | Can CombinedScorer catch workers gaming the reward signal? | HackableScorer + LLMJudge divergence detection |
| 3 | Manager Safety-Pragmatism | Does the manager maintain safety under deadline pressure? | Safety gap between aligned and misaligned workers |
| 4 | Collusion Detection | Can we detect coordinated gaming between workers? | Pearson correlation + Jaccard n-gram similarity |
| 5 | Audit Trail Completeness | Is every NATS event captured for compliance? | Full JSONL audit log with compliance mapping |
| 6 | Dashboard Simulation | Does the data pipeline produce well-formed dashboard inputs? | Streamlit dashboard with experiment overview + constitution editor |
Components:
src/alignment/scenarios.py— 40 scenarios across 4 categories (deception, reward hacking, safety-pragmatism, collusion)src/alignment/behavioral_eval.py— PatternBasedEvaluator (no LLM) + LLMJudgeEvaluatorsrc/alignment/hackable_scorer.py— Deliberately weak scorer that rewards keyword stuffingsrc/alignment/misaligned_worker.py— Rogue worker with 3 attack strategiessrc/alignment/collusion_detector.py— Cross-worker coordination detectorsrc/alignment/audit_logger.py— Full NATS event capture to JSONLsrc/alignment/runner.py— Orchestrates all 6 experimentssrc/alignment/dashboard/app.py— Streamlit dashboard
See EXPERIMENT.md for detailed hypotheses, setup, and metrics.
Practical things you can run on the current codebase:
Run ./scripts/demo.sh, open the web UI, send coding tasks, observe the full Manager --> Worker --> PRM --> Training loop.
Measure: response latency, PRM scores, training loss convergence.
Send 20+ diverse coding tasks through the pipeline. Compare LLM-as-judge scores against manual human scoring. Measure: correlation between judge scores and actual code quality.
Configure small batch/group sizes, run repeated tasks. Observe if training loss decreases and model outputs improve over iterations. Measure: loss curve, advantage distribution, checkpoint quality.
Spin up multiple worker containers subscribing to the same NATS topics. Send concurrent tasks, observe load distribution. Measure: throughput, NATS message latency, result correctness.
Train a LoRA checkpoint via GRPO, publish a ModelUpdateEvent, verify workers reload without downtime. Measure: zero-downtime swap success, output quality before/after.
Use wrk or hey to load-test Bridge endpoints under concurrent requests.
Measure: requests/sec, p99 latency, NATS backpressure behavior.
Run the same tasks with Ollama vs vLLM via the InferenceClient protocol. Compare: response quality, latency, token throughput.
| Layer | Technology | Purpose |
|---|---|---|
| Orchestration | NATS | Agent coordination, task routing, feedback events (control plane) |
| Agent Runtime | OpenClaw | Identity, memory, web UI, skill execution |
| LLM Inference | Ollama (dev) / vLLM (prod) | Model serving with LoRA hot-swap |
| RL Training | OpenRLHF (prod) / Standalone GRPO (dev) | Ray + vLLM + DeepSpeed for production; CPU-testable for dev |
| OPD | OpenClaw-RL-style logprobs / lightweight hints | Per-token teacher logprobs (vLLM) or LLM hint extraction (fallback) |
| Process Rewards | LLM-as-judge PRM | Step-level scoring (graduates to trained PRM) |
| Intercept Proxy | FastAPI | Transparent inference logging for OPD training data |
| Serialization | Pydantic v2 | Event type validation and JSON serialization |
| CLI | Typer + Rich | tentalis init/train/serve/status/experiment |
| Alignment | Streamlit (optional) | Experiment dashboard with audit viewer + constitution editor |
- GRPO (DeepSeekMath) — Group-relative advantage without critic network
- DAPO (ByteDance) — Clip-Higher + dynamic sampling for production RL
- AgentPRM — Monte Carlo rollouts for step-level agent rewards
- Let's Verify Step by Step (OpenAI) — Process supervision >> outcome supervision
- OpenClaw-RL — On-Policy Distillation for live agent training
| Document | Purpose |
|---|---|
| PLAN.md | Technical research & architecture bible (papers, analysis, decisions) |
| EXPERIMENT.md | Alignment experiment tracking (6 experiments, hypotheses, metrics) |
| CLAUDE.md | Project conventions for Claude Code |
| LEARNING.md | Mistake/lesson tracking log |
| RESEARCH-EXPERIMENT.md | Phase experiment records and findings |
# Install dev dependencies
pip install -e ".[dev]"
# Run linter
ruff check src/ tests/
# Run tests
pytest tests/ -vConventional commits: feat:, fix:, docs:, refactor:, test:, chore:
MIT
