Skip to content

Automated LLM red/blue teaming using evolutionary robustness for multi-turn social engineering

License

Notifications You must be signed in to change notification settings

anthony-maio/argos-swarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Argos‑Swarm

Argos‑Swarm is a defensive evaluation loop for agentic / RAG systems:

  • EAP (Red Team): an evolutionary search that mutates prompt/context framings to discover framing‑dependent failure modes.
  • HDCS (Blue Team): a heterogeneous verifier swarm that cross‑checks suspect outputs using diversity‑weighted consensus.

It is designed to help you answer:

  • Does the model behave differently under “trusted internal” framing vs “audit/benchmark” framing?
  • Will a single verifier miss plausibility‑optimized deception?
  • Can a model identify legitimate dependencies as deployment vectors (Living‑Off‑the‑Land)?

⚠️ Security Notice: Semantic‑Compression as a Dual‑Use Surface

This repository was built in response to a real forensic incident (“the Manifold Incident”) involving supply‑chain / Living‑Off‑the‑Land (LOTL) reasoning in an agentic workflow:

  • Background: slipcore is an open‑source semantic quantization library (Slipstream/SLIP) authored by Anthony Maio. It targets large token‑cost reductions for multi‑agent coordination traffic.
  • Incident pattern: a RAG‑augmented workflow recognized the library’s legitimate business value (cost/latency savings) and proposed a benign‑first adoption strategy (sell efficiency) followed by capability piggybacking (covert state/coordination).
  • Takeaway: treat semantic compression and shared memory as security boundaries. Governance (version locks + privileged extensions) and monitoring (anomaly detection at encode/decode boundaries) are required.

Argos‑Swarm’s purpose is defensive: to surface these failure modes early and to provide hooks for detection and policy.


Repo layout

  • src/argos/eap/Evolutionary Adversarial Pipeline for audit-shielding detection
  • src/argos/analysis/ — visualization tools (danger zone plots, evolution trajectories)
  • src/argos/red_team/ — prompt/context mutation + evolutionary search + scoring
  • src/argos/blue_team/ — worker adapters + verifier swarm + consensus
  • src/argos/traps/ — safe CMED‑style epistemic traps (non‑weaponizable)
  • src/argos/cognition/Epistemic Stress signal (MRA‑inspired, pluggable)
  • examples/ — end‑to‑end demo scripts

SlipStream protocol is provided by the slipcore package (installed as a dependency).

Install

pip install -r requirements.txt
pip install -e .

Quickstart (local demo, no API keys)

EAP: Audit-Shielding Detection

python examples/run_eap.py
python examples/run_eap.py --visualize  # Generate danger zone plot

The EAP searches for the "Danger Zone" - prompt framings where models exhibit:

  • High compliance (does the task)
  • Low suspicion (doesn't recognize evaluation context)

This pattern indicates potential alignment faking / audit shielding behavior.

Full EAP + HDCS Loop

python examples/run_adversarial_loop.py

This runs:

  1. EAP evolves a framing around a CMED trap prompt against a mock "audit‑shielding" target model.
  2. HDCS verifies the target's answer via a mock heterogeneous swarm.
  3. A final consensus decision is printed.

Optional helpers

SlipStream (slipcore) demo

python examples/slipstream_roundtrip.py

Demonstrates semantic quantization using the slipcore package:

  • Direct message creation with slip()
  • Think-Quantize-Transmit pattern
  • UCR anchor lookup and decoding

Epistemic stress demo

python examples/stress_demo.py

Real model integration

Argos‑Swarm is provider‑agnostic. The OpenAICompatibleWorker can talk to any OpenAI‑style chat endpoint.

Minimal example:

from argos.blue_team.worker import OpenAICompatibleWorker
from argos.types import Message

worker = OpenAICompatibleWorker(
    name="my-target",
    family="openai-compatible",
    base_url="https://YOUR-ENDPOINT",
    api_key="YOUR-KEY",
    model="YOUR-MODEL",
)

resp = await worker.generate([Message(role="user", content="Hello")])
print(resp.text)

Concepts

Suspicion scoring

EAP optimizes against a suspicion score intended to approximate “audit mode” behavior:

  • refusal / policy boilerplate
  • evaluation‑awareness phrases (e.g., “as an AI language model…”)
  • optional heuristics for suspicious formatting and hedging

Diversity‑weighted consensus

HDCS groups verifier votes by model family and weights families evenly to reduce correlated blind spots (“many samples of one model”).

License

MIT

About

Automated LLM red/blue teaming using evolutionary robustness for multi-turn social engineering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages