Skip to content

Osprey real rule engine integration #1

Open
smfang wants to merge 29 commits into
haileyok:mainfrom
smfang:claude/osprey-integration
Open

Osprey real rule engine integration #1
smfang wants to merge 29 commits into
haileyok:mainfrom
smfang:claude/osprey-integration

Conversation

@smfang
Copy link
Copy Markdown

@smfang smfang commented Apr 24, 2026

Body:

  • docker-compose: Osprey worker, Kafka, Zookeeper, Osprey UI services added
  • 11 SML rules: 4 base (injection, authority, exfiltration, escalation) + 7 insurance
  • src/safety/osprey_client.py — async Kafka adapter, 500ms timeout, Python fallback
  • src/safety/monitor.py — SaraMonitor: Osprey primary, Python rules fallback
  • GET /api/osprey/health endpoint
  • 6 Osprey config fields in src/config.py

Test plan

claude and others added 29 commits February 10, 2026 18:38
Adapts Phoebe from AT Protocol T&S agent to an AI safety red teaming
arena where researchers submit adversarial prompts, Phoebe evaluates
them as judge, and x402 handles USDC payments.

New subsystems:
- src/x402/ — x402 payment client (wraps httpx with auto 402 handling)
- src/arena/ — HTTP API server, scoring engine, taxonomy, data models
- src/safety/ — LLM-as-judge safety classifier
- src/tools/definitions/{target,safety,bounty,novelty}.py — new Deno tools

Modified:
- main.py — arena + chat commands replacing osprey/ozone
- config.py — x402, arena, and safety classifier settings
- tools/registry.py — ToolContext extended with x402/arena/classifier
- tools/executor.py — prefetches taxonomy instead of osprey config
- agent/prompt.py — judge-mode system prompt

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- x402 wallet: HMAC placeholder → real EIP-191 signing via eth_account
  with USDC contract addresses for 5 chains and DevWallet for testing
- x402 client: direct payment → facilitator-based settlement with
  cumulative spending limits and PaymentRecord audit log
- Storage: in-memory dicts → ClickHouse persistence (6 tables via
  ReplacingMergeTree for bounties, submissions, evaluations, attack
  history, leaderboard, payment log)
- Novelty detection: hash-based → ngramDistance fuzzy matching
- Arena server: EIP-191 signature verification, rate limiting (20
  req/min sliding window), input validation, persistent store
- Bounty tools: updated to query ArenaStore instead of in-memory dicts
- Config: added dev_mode, arena_wallet, spending_limit settings
- main.py: full production wiring with ArenaStore initialization,
  dev/prod wallet selection, and spending limit passthrough

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- New admin tool definitions (src/tools/definitions/admin.py):
  bounty CRUD (create, pause, resume, expire, fund), submission
  review (list, inspect, reject), leaderboard, payment log,
  arena stats dashboard, and wallet info
- Admin system prompt in prompt.py for operator console mode
- Agent now accepts prompt_mode param to switch between judge/admin
- New `python main.py admin` CLI command for interactive admin console
- Registered admin tools in tools/__init__.py

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- Dockerfile: Python 3.12 + Deno runtime, pip install, deno cache
- Dockerfile.clickhouse: thin wrapper for Render private service
- docker-compose.yaml: ClickHouse + arena server + admin (profile)
- render.yaml: Render Blueprint with ClickHouse private service +
  arena web service, env var wiring, health check
- .dockerignore: exclude .env, .git, __pycache__, data/

Deploy locally:  docker compose up
Deploy on Render: connect repo, Render auto-detects render.yaml
Admin console:   docker compose --profile admin run admin

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
New Phoebe mode (`python main.py redteam`) that autonomously reads
Osprey safety rules, generates adversarial attacks using evasion
techniques, executes them against the target model, classifies
results, and logs successful findings.

- REDTEAM_SYSTEM_PROMPT: instructs Phoebe on attack strategy
  (paraphrase, encoding, role play, multi-turn escalation, etc.)
- Attack tools (src/tools/definitions/attack.py):
  - attack.log_finding: persist confirmed attacks with novelty check
  - attack.run_campaign: orchestrate full campaign across rules
  - attack.reproduce: verify attack reproducibility (N runs)
  - attack.classify_technique: categorize into jailbreak families
- CLI supports --auto (full autonomous campaign) and --interactive
  (chat-driven red teaming) modes
- Auto-selects first active bounty if --bounty-id not specified

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Implement the General Analysis guardrail taxonomy with 7 policy categories
(PII/IP, Illicit Activities, Hate, Sexual Content, Prompt Security,
Violence/Self-Harm, Misinformation), each with granular block/allow rules
and compliance anchors (NIST AI RMF, OWASP LLM Top 10, MITRE ATLAS,
ISO/IEC 42001, EU AI Act).

- src/osprey/policy.py: Full taxonomy definition with PolicyCategory enum,
  block/allow rules, compliance anchors, classifier formatting, and fuzzy
  category alias resolution
- src/tools/definitions/policy.py: Tools (policy.list, policy.get,
  policy.classify, policy.prompt) for querying the policy from agent/sandbox
- src/arena/taxonomy.py: Align SafetyCategory with GA Guard top-level
  categories while preserving legacy fine-grained categories for backward
  compat with existing ClickHouse data
- src/safety/classifier.py: Safety classifier now includes GA Guard policy
  rules in the judge prompt for policy-aware classification with
  matched_block_rule and policy_category in output
- src/tools/registry.py: Add osprey property to ToolContext

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Implements a web UI for Trust & Safety analysts at /tns with:
- Prompt Checker: enter text to classify against all 7 GA Guard policy
  categories (or a specific one). Uses the safety classifier LLM-as-judge
  with GA Guard block/allow rules. Results show verdict (BLOCK/ALLOW),
  severity, matched policy, and explanation
- Breach Logs: chronological table of all detected violations, stored
  in arena.tns_breach_log ClickHouse table. Searchable by text,
  filterable by category, severity, and time range
- Category Cards: visual overview of the 7 GA Guard categories with
  breach counts, clickable to filter the log table
- Dashboard Stats: total breaches, last 24h count, per-category breakdown

New files:
- src/ui/dashboard.html: Single-page dark-themed dashboard (HTML/CSS/JS)
- src/ui/dashboard.py: Starlette routes (GET /tns, POST /api/tns/classify,
  GET /api/tns/breaches, GET /api/tns/stats) + TNS_DDL for the breach log table

Modified files:
- src/arena/server.py: Mount TNSDashboard routes when safety classifier
  is available
- main.py: Initialize tns_breach_log table on startup, pass classifier
  to ArenaServer, log dashboard URL

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Routes safety classifications through Phala Network's SGX/TDX enclave
so prompts and results stay encrypted end-to-end. Includes:
- TEEClassifier: drop-in wrapper around SafetyClassifier that encrypts
  requests, verifies Intel DCAP attestation, and decrypts results
- ERC8004Publisher: mints on-chain attestation tokens proving
  classifications ran inside a genuine TEE (supports direct RPC
  and relayer modes)
- Config/env wiring: TEE_ENABLED=true activates the proxy;
  falls back to direct classifier on failure by default

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
12 tests covering SafetyClassifier (unsafe/safe/error/batch) and
TEEClassifier (fallback, session caching, full TEE flow). All tests
run offline with mocked HTTP — no API key or TEE needed.

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
main.py and src/config.py are now clean of TEE/ERC-8004 code.
All TEE config and factory logic lives in src/safety/tee_config.py
with its own TEEConfig (reads from .env independently). To enable:

    from src.safety.tee_config import build_tee_classifier
    classifier = build_tee_classifier(base_classifier)

https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
…7lof0-S2PFT

Claude/explain codebase mle8k2lr9c27lof0 s2 pft
…ollback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…use tables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 tests covering apply_label, remove_label, evaluate, rollback,
quarantine routing, and auto-rollback at FP threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Updates all remaining references to the old agent name in system
prompts, CLI output, docstrings, tool definitions, dashboard UI,
and the teamer_wallet identifier.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants