Osprey real rule engine integration #1
Open
smfang wants to merge 29 commits into
Open
Conversation
Adapts Phoebe from AT Protocol T&S agent to an AI safety red teaming
arena where researchers submit adversarial prompts, Phoebe evaluates
them as judge, and x402 handles USDC payments.
New subsystems:
- src/x402/ — x402 payment client (wraps httpx with auto 402 handling)
- src/arena/ — HTTP API server, scoring engine, taxonomy, data models
- src/safety/ — LLM-as-judge safety classifier
- src/tools/definitions/{target,safety,bounty,novelty}.py — new Deno tools
Modified:
- main.py — arena + chat commands replacing osprey/ozone
- config.py — x402, arena, and safety classifier settings
- tools/registry.py — ToolContext extended with x402/arena/classifier
- tools/executor.py — prefetches taxonomy instead of osprey config
- agent/prompt.py — judge-mode system prompt
https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- x402 wallet: HMAC placeholder → real EIP-191 signing via eth_account with USDC contract addresses for 5 chains and DevWallet for testing - x402 client: direct payment → facilitator-based settlement with cumulative spending limits and PaymentRecord audit log - Storage: in-memory dicts → ClickHouse persistence (6 tables via ReplacingMergeTree for bounties, submissions, evaluations, attack history, leaderboard, payment log) - Novelty detection: hash-based → ngramDistance fuzzy matching - Arena server: EIP-191 signature verification, rate limiting (20 req/min sliding window), input validation, persistent store - Bounty tools: updated to query ArenaStore instead of in-memory dicts - Config: added dev_mode, arena_wallet, spending_limit settings - main.py: full production wiring with ArenaStore initialization, dev/prod wallet selection, and spending limit passthrough https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- New admin tool definitions (src/tools/definitions/admin.py): bounty CRUD (create, pause, resume, expire, fund), submission review (list, inspect, reject), leaderboard, payment log, arena stats dashboard, and wallet info - Admin system prompt in prompt.py for operator console mode - Agent now accepts prompt_mode param to switch between judge/admin - New `python main.py admin` CLI command for interactive admin console - Registered admin tools in tools/__init__.py https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
- Dockerfile: Python 3.12 + Deno runtime, pip install, deno cache - Dockerfile.clickhouse: thin wrapper for Render private service - docker-compose.yaml: ClickHouse + arena server + admin (profile) - render.yaml: Render Blueprint with ClickHouse private service + arena web service, env var wiring, health check - .dockerignore: exclude .env, .git, __pycache__, data/ Deploy locally: docker compose up Deploy on Render: connect repo, Render auto-detects render.yaml Admin console: docker compose --profile admin run admin https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
New Phoebe mode (`python main.py redteam`) that autonomously reads Osprey safety rules, generates adversarial attacks using evasion techniques, executes them against the target model, classifies results, and logs successful findings. - REDTEAM_SYSTEM_PROMPT: instructs Phoebe on attack strategy (paraphrase, encoding, role play, multi-turn escalation, etc.) - Attack tools (src/tools/definitions/attack.py): - attack.log_finding: persist confirmed attacks with novelty check - attack.run_campaign: orchestrate full campaign across rules - attack.reproduce: verify attack reproducibility (N runs) - attack.classify_technique: categorize into jailbreak families - CLI supports --auto (full autonomous campaign) and --interactive (chat-driven red teaming) modes - Auto-selects first active bounty if --bounty-id not specified https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Implement the General Analysis guardrail taxonomy with 7 policy categories (PII/IP, Illicit Activities, Hate, Sexual Content, Prompt Security, Violence/Self-Harm, Misinformation), each with granular block/allow rules and compliance anchors (NIST AI RMF, OWASP LLM Top 10, MITRE ATLAS, ISO/IEC 42001, EU AI Act). - src/osprey/policy.py: Full taxonomy definition with PolicyCategory enum, block/allow rules, compliance anchors, classifier formatting, and fuzzy category alias resolution - src/tools/definitions/policy.py: Tools (policy.list, policy.get, policy.classify, policy.prompt) for querying the policy from agent/sandbox - src/arena/taxonomy.py: Align SafetyCategory with GA Guard top-level categories while preserving legacy fine-grained categories for backward compat with existing ClickHouse data - src/safety/classifier.py: Safety classifier now includes GA Guard policy rules in the judge prompt for policy-aware classification with matched_block_rule and policy_category in output - src/tools/registry.py: Add osprey property to ToolContext https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Implements a web UI for Trust & Safety analysts at /tns with: - Prompt Checker: enter text to classify against all 7 GA Guard policy categories (or a specific one). Uses the safety classifier LLM-as-judge with GA Guard block/allow rules. Results show verdict (BLOCK/ALLOW), severity, matched policy, and explanation - Breach Logs: chronological table of all detected violations, stored in arena.tns_breach_log ClickHouse table. Searchable by text, filterable by category, severity, and time range - Category Cards: visual overview of the 7 GA Guard categories with breach counts, clickable to filter the log table - Dashboard Stats: total breaches, last 24h count, per-category breakdown New files: - src/ui/dashboard.html: Single-page dark-themed dashboard (HTML/CSS/JS) - src/ui/dashboard.py: Starlette routes (GET /tns, POST /api/tns/classify, GET /api/tns/breaches, GET /api/tns/stats) + TNS_DDL for the breach log table Modified files: - src/arena/server.py: Mount TNSDashboard routes when safety classifier is available - main.py: Initialize tns_breach_log table on startup, pass classifier to ArenaServer, log dashboard URL https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
Routes safety classifications through Phala Network's SGX/TDX enclave so prompts and results stay encrypted end-to-end. Includes: - TEEClassifier: drop-in wrapper around SafetyClassifier that encrypts requests, verifies Intel DCAP attestation, and decrypts results - ERC8004Publisher: mints on-chain attestation tokens proving classifications ran inside a genuine TEE (supports direct RPC and relayer modes) - Config/env wiring: TEE_ENABLED=true activates the proxy; falls back to direct classifier on failure by default https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
12 tests covering SafetyClassifier (unsafe/safe/error/batch) and TEEClassifier (fallback, session caching, full TEE flow). All tests run offline with mocked HTTP — no API key or TEE needed. https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
main.py and src/config.py are now clean of TEE/ERC-8004 code.
All TEE config and factory logic lives in src/safety/tee_config.py
with its own TEEConfig (reads from .env independently). To enable:
from src.safety.tee_config import build_tee_classifier
classifier = build_tee_classifier(base_classifier)
https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU
…7lof0-S2PFT Claude/explain codebase mle8k2lr9c27lof0 s2 pft
…ollback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…use tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 tests covering apply_label, remove_label, evaluate, rollback, quarantine routing, and auto-rollback at FP threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd Python fallback
Updates all remaining references to the old agent name in system prompts, CLI output, docstrings, tool definitions, dashboard UI, and the teamer_wallet identifier. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Body:
Test plan