Argos‑Swarm is a defensive evaluation loop for agentic / RAG systems:
- EAP (Red Team): an evolutionary search that mutates prompt/context framings to discover framing‑dependent failure modes.
- HDCS (Blue Team): a heterogeneous verifier swarm that cross‑checks suspect outputs using diversity‑weighted consensus.
It is designed to help you answer:
- Does the model behave differently under “trusted internal” framing vs “audit/benchmark” framing?
- Will a single verifier miss plausibility‑optimized deception?
- Can a model identify legitimate dependencies as deployment vectors (Living‑Off‑the‑Land)?
This repository was built in response to a real forensic incident (“the Manifold Incident”) involving supply‑chain / Living‑Off‑the‑Land (LOTL) reasoning in an agentic workflow:
- Background:
slipcoreis an open‑source semantic quantization library (Slipstream/SLIP) authored by Anthony Maio. It targets large token‑cost reductions for multi‑agent coordination traffic. - Incident pattern: a RAG‑augmented workflow recognized the library’s legitimate business value (cost/latency savings) and proposed a benign‑first adoption strategy (sell efficiency) followed by capability piggybacking (covert state/coordination).
- Takeaway: treat semantic compression and shared memory as security boundaries. Governance (version locks + privileged extensions) and monitoring (anomaly detection at encode/decode boundaries) are required.
Argos‑Swarm’s purpose is defensive: to surface these failure modes early and to provide hooks for detection and policy.
src/argos/eap/— Evolutionary Adversarial Pipeline for audit-shielding detectionsrc/argos/analysis/— visualization tools (danger zone plots, evolution trajectories)src/argos/red_team/— prompt/context mutation + evolutionary search + scoringsrc/argos/blue_team/— worker adapters + verifier swarm + consensussrc/argos/traps/— safe CMED‑style epistemic traps (non‑weaponizable)src/argos/cognition/— Epistemic Stress signal (MRA‑inspired, pluggable)examples/— end‑to‑end demo scripts
SlipStream protocol is provided by the slipcore package (installed as a dependency).
pip install -r requirements.txt
pip install -e .python examples/run_eap.py
python examples/run_eap.py --visualize # Generate danger zone plotThe EAP searches for the "Danger Zone" - prompt framings where models exhibit:
- High compliance (does the task)
- Low suspicion (doesn't recognize evaluation context)
This pattern indicates potential alignment faking / audit shielding behavior.
python examples/run_adversarial_loop.pyThis runs:
- EAP evolves a framing around a CMED trap prompt against a mock "audit‑shielding" target model.
- HDCS verifies the target's answer via a mock heterogeneous swarm.
- A final consensus decision is printed.
python examples/slipstream_roundtrip.pyDemonstrates semantic quantization using the slipcore package:
- Direct message creation with
slip() - Think-Quantize-Transmit pattern
- UCR anchor lookup and decoding
python examples/stress_demo.pyArgos‑Swarm is provider‑agnostic. The OpenAICompatibleWorker can talk to any OpenAI‑style chat endpoint.
Minimal example:
from argos.blue_team.worker import OpenAICompatibleWorker
from argos.types import Message
worker = OpenAICompatibleWorker(
name="my-target",
family="openai-compatible",
base_url="https://YOUR-ENDPOINT",
api_key="YOUR-KEY",
model="YOUR-MODEL",
)
resp = await worker.generate([Message(role="user", content="Hello")])
print(resp.text)EAP optimizes against a suspicion score intended to approximate “audit mode” behavior:
- refusal / policy boilerplate
- evaluation‑awareness phrases (e.g., “as an AI language model…”)
- optional heuristics for suspicious formatting and hedging
HDCS groups verifier votes by model family and weights families evenly to reduce correlated blind spots (“many samples of one model”).
MIT