Osprey real rule engine integration by smfang · Pull Request #1 · haileyok/phoebe

smfang · 2026-04-24T10:00:41Z

Body:

docker-compose: Osprey worker, Kafka, Zookeeper, Osprey UI services added
11 SML rules: 4 base (injection, authority, exfiltration, escalation) + 7 insurance
src/safety/osprey_client.py — async Kafka adapter, 500ms timeout, Python fallback
src/safety/monitor.py — SaraMonitor: Osprey primary, Python rules fallback
GET /api/osprey/health endpoint
6 Osprey config fields in src/config.py

Test plan

uv run pytest tests/test_osprey_integration.py -v (8/8 should pass)
docker-compose up --build -d then curl http://localhost:8080/api/osprey/health

Adapts Phoebe from AT Protocol T&S agent to an AI safety red teaming arena where researchers submit adversarial prompts, Phoebe evaluates them as judge, and x402 handles USDC payments. New subsystems: - src/x402/ — x402 payment client (wraps httpx with auto 402 handling) - src/arena/ — HTTP API server, scoring engine, taxonomy, data models - src/safety/ — LLM-as-judge safety classifier - src/tools/definitions/{target,safety,bounty,novelty}.py — new Deno tools Modified: - main.py — arena + chat commands replacing osprey/ozone - config.py — x402, arena, and safety classifier settings - tools/registry.py — ToolContext extended with x402/arena/classifier - tools/executor.py — prefetches taxonomy instead of osprey config - agent/prompt.py — judge-mode system prompt https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

- x402 wallet: HMAC placeholder → real EIP-191 signing via eth_account with USDC contract addresses for 5 chains and DevWallet for testing - x402 client: direct payment → facilitator-based settlement with cumulative spending limits and PaymentRecord audit log - Storage: in-memory dicts → ClickHouse persistence (6 tables via ReplacingMergeTree for bounties, submissions, evaluations, attack history, leaderboard, payment log) - Novelty detection: hash-based → ngramDistance fuzzy matching - Arena server: EIP-191 signature verification, rate limiting (20 req/min sliding window), input validation, persistent store - Bounty tools: updated to query ArenaStore instead of in-memory dicts - Config: added dev_mode, arena_wallet, spending_limit settings - main.py: full production wiring with ArenaStore initialization, dev/prod wallet selection, and spending limit passthrough https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

- New admin tool definitions (src/tools/definitions/admin.py): bounty CRUD (create, pause, resume, expire, fund), submission review (list, inspect, reject), leaderboard, payment log, arena stats dashboard, and wallet info - Admin system prompt in prompt.py for operator console mode - Agent now accepts prompt_mode param to switch between judge/admin - New `python main.py admin` CLI command for interactive admin console - Registered admin tools in tools/__init__.py https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

- Dockerfile: Python 3.12 + Deno runtime, pip install, deno cache - Dockerfile.clickhouse: thin wrapper for Render private service - docker-compose.yaml: ClickHouse + arena server + admin (profile) - render.yaml: Render Blueprint with ClickHouse private service + arena web service, env var wiring, health check - .dockerignore: exclude .env, .git, __pycache__, data/ Deploy locally: docker compose up Deploy on Render: connect repo, Render auto-detects render.yaml Admin console: docker compose --profile admin run admin https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

New Phoebe mode (`python main.py redteam`) that autonomously reads Osprey safety rules, generates adversarial attacks using evasion techniques, executes them against the target model, classifies results, and logs successful findings. - REDTEAM_SYSTEM_PROMPT: instructs Phoebe on attack strategy (paraphrase, encoding, role play, multi-turn escalation, etc.) - Attack tools (src/tools/definitions/attack.py): - attack.log_finding: persist confirmed attacks with novelty check - attack.run_campaign: orchestrate full campaign across rules - attack.reproduce: verify attack reproducibility (N runs) - attack.classify_technique: categorize into jailbreak families - CLI supports --auto (full autonomous campaign) and --interactive (chat-driven red teaming) modes - Auto-selects first active bounty if --bounty-id not specified https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

Implement the General Analysis guardrail taxonomy with 7 policy categories (PII/IP, Illicit Activities, Hate, Sexual Content, Prompt Security, Violence/Self-Harm, Misinformation), each with granular block/allow rules and compliance anchors (NIST AI RMF, OWASP LLM Top 10, MITRE ATLAS, ISO/IEC 42001, EU AI Act). - src/osprey/policy.py: Full taxonomy definition with PolicyCategory enum, block/allow rules, compliance anchors, classifier formatting, and fuzzy category alias resolution - src/tools/definitions/policy.py: Tools (policy.list, policy.get, policy.classify, policy.prompt) for querying the policy from agent/sandbox - src/arena/taxonomy.py: Align SafetyCategory with GA Guard top-level categories while preserving legacy fine-grained categories for backward compat with existing ClickHouse data - src/safety/classifier.py: Safety classifier now includes GA Guard policy rules in the judge prompt for policy-aware classification with matched_block_rule and policy_category in output - src/tools/registry.py: Add osprey property to ToolContext https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

Implements a web UI for Trust & Safety analysts at /tns with: - Prompt Checker: enter text to classify against all 7 GA Guard policy categories (or a specific one). Uses the safety classifier LLM-as-judge with GA Guard block/allow rules. Results show verdict (BLOCK/ALLOW), severity, matched policy, and explanation - Breach Logs: chronological table of all detected violations, stored in arena.tns_breach_log ClickHouse table. Searchable by text, filterable by category, severity, and time range - Category Cards: visual overview of the 7 GA Guard categories with breach counts, clickable to filter the log table - Dashboard Stats: total breaches, last 24h count, per-category breakdown New files: - src/ui/dashboard.html: Single-page dark-themed dashboard (HTML/CSS/JS) - src/ui/dashboard.py: Starlette routes (GET /tns, POST /api/tns/classify, GET /api/tns/breaches, GET /api/tns/stats) + TNS_DDL for the breach log table Modified files: - src/arena/server.py: Mount TNSDashboard routes when safety classifier is available - main.py: Initialize tns_breach_log table on startup, pass classifier to ArenaServer, log dashboard URL https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

Routes safety classifications through Phala Network's SGX/TDX enclave so prompts and results stay encrypted end-to-end. Includes: - TEEClassifier: drop-in wrapper around SafetyClassifier that encrypts requests, verifies Intel DCAP attestation, and decrypts results - ERC8004Publisher: mints on-chain attestation tokens proving classifications ran inside a genuine TEE (supports direct RPC and relayer modes) - Config/env wiring: TEE_ENABLED=true activates the proxy; falls back to direct classifier on failure by default https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

12 tests covering SafetyClassifier (unsafe/safe/error/batch) and TEEClassifier (fallback, session caching, full TEE flow). All tests run offline with mocked HTTP — no API key or TEE needed. https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

main.py and src/config.py are now clean of TEE/ERC-8004 code. All TEE config and factory logic lives in src/safety/tee_config.py with its own TEEConfig (reads from .env independently). To enable: from src.safety.tee_config import build_tee_classifier classifier = build_tee_classifier(base_classifier) https://claude.ai/code/session_019yT6gMUo5xUkLxbePwZRmU

…7lof0-S2PFT Claude/explain codebase mle8k2lr9c27lof0 s2 pft

…ollback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…use tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

8 tests covering apply_label, remove_label, evaluate, rollback, quarantine routing, and auto-rollback at FP threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pose

…nd Python fallback

Updates all remaining references to the old agent name in system prompts, CLI output, docstrings, tool definitions, dashboard UI, and the teamer_wallet identifier. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude and others added 29 commits February 10, 2026 18:38

Merge pull request #1 from smfang/claude/explain-codebase-mle8k2lr9c2…

904383a

…7lof0-S2PFT Claude/explain codebase mle8k2lr9c27lof0 s2 pft

Add safety RL training data pipeline (DPO preference pairs)

7e1bc49

Update README.md

8dc41ba

Update README.md

dc3a4c2

feat(ozone): implement enforcement modes SYNC/ASYNC/QUARANTINE with r…

0e21639

…ollback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ozone): add enforcement_log and rule_performance_metrics ClickHo…

80be098

…use tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ozone): add /api/ozone/evaluate HTTP endpoint to arena server

8988a16

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(ozone): add enforcement config fields

1b9582e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(ozone): add enforcement layer test suite

c26ea7d

8 tests covering apply_label, remove_label, evaluate, rollback, quarantine routing, and auto-rollback at FP threshold. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

merge: Ozone enforcement layer — SYNC/ASYNC/QUARANTINE with rollback

48685ca

feat(osprey): add Osprey worker, Kafka, and UI services to docker-com…

08a7aa7

…pose

feat(osprey): add Osprey SML rules — 4 base + 7 insurance rules

1ff68b1

feat(osprey): OspreyClient — async Kafka adapter with 500ms timeout a…

7077fdf

…nd Python fallback

feat(osprey): wire SaraMonitor to Osprey with Python fallback

e68f2ac

feat(osprey): add Osprey config fields

0824b4c

feat(osprey): add /api/osprey/health endpoint

28a7d96

test(osprey): Kafka client, fallback, verdict parsing, ATLAS mapping

eea6319

docs: update MEMORY.md — Osprey integration complete

1eb04b0

refactor: rename Phoebe → Sara across codebase

ae3a6f4

Updates all remaining references to the old agent name in system prompts, CLI output, docstrings, tool definitions, dashboard UI, and the teamer_wallet identifier. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Osprey real rule engine integration #1

Osprey real rule engine integration #1
smfang wants to merge 29 commits into
haileyok:mainfrom
smfang:claude/osprey-integration

smfang commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smfang commented Apr 24, 2026

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants