-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Summary
I built an open-source OPA-gated MCP server (Claw) that adds a formal conflict resolution layer to Claude's tool use pipeline. When multiple policy rules produce contradictory decisions (e.g., "deny — PHI detected" vs. "allow — authorized research domain"), the system resolves conflicts using Dung's Abstract Argumentation Framework (1995) instead of hardcoded priority ordering.
I'm sharing implementation notes in case they're useful to others building governance layers for MCP tool use, and to see if there's interest in a cookbook recipe covering this pattern.
The problem this addresses
MCP servers that enforce behavioral constraints typically stack multiple checks: PII detection, domain reputation, policy packs (HIPAA, financial compliance), and contextual signals. These checks regularly conflict. Most implementations resolve conflicts with priority numbers or "deny wins" logic. This works until:
- A trusted research domain sends content containing patient identifiers — does the trust record override the PII block, or does the PII block override the trust record?
- A healthcare policy pack denies content that a research policy pack would allow — which pack wins when the user has both contexts?
- A domain was previously flagged suspicious, but the OPA policy says the content type is safe — which signal dominates?
These aren't edge cases. They're the normal operating mode when you have more than 3-4 policy rules.
How the argumentation approach works
Instead of priority ordering, each policy signal becomes a formal argument with a source, strength, and decision claim. Arguments that contradict each other attack each other in a directed graph. The engine then computes which arguments survive all attacks using Dung's characteristic function (iterative fixpoint):
S₀ = ∅
Sₙ₊₁ = F(Sₙ) = { a ∈ Args | framework defends a w.r.t. Sₙ }
Stop when Sₙ₊₁ = Sₙ
The surviving arguments (the "grounded extension") determine the decision. The full attack graph is returned as part of the API response, so every decision is traceable.
Concrete example
Input: content containing 2 email addresses from a domain previously marked trusted.
Arguments generated:
| ID | Claim | Source | Strength | Decision |
|---|---|---|---|---|
| baseline_allow | Content should be allowed | OPA default | 0.3 | allow |
| pii_moderate | 2 non-critical PII items — masking required | PII Scanner | 0.6 | modify |
| knowledge_trust_0 | Domain previously marked trusted | Knowledge Hub | 0.8 | allow |
Attack relations:
pii_moderate → baseline_allow(UNDERCUT: modification overrides baseline)knowledge_trust_0 → pii_moderate(UNDERMINE: trust record challenges PII block)
Grounded extension: {baseline_allow, knowledge_trust_0} — the trust record defends the allow decision because it attacks the only argument attacking the baseline.
Decision: allow (no modification needed)
If the same content came from an untrusted domain, the knowledge_trust_0 argument wouldn't exist, the grounded extension would be {pii_moderate}, and the decision would be allow_with_modifications (PII masking applied).
The point is: the same engine handles both cases through graph computation, not conditional branching. And the full graph is available for audit.
Implementation details
The system runs as a 6-stage pipeline:
PII Scan → OPA Policy Gate → Knowledge Hub → Argumentation → Context Assembly → Model
Key files in the repo:
server/argumentation/models.py— Dung's AF data structures (Argument, Attack, Extension, Framework)server/argumentation/engine.py— Extension computation (grounded, preferred, stable semantics)server/argumentation/rego_bridge.py— Converts OPA decisions + Knowledge Hub entries into formal argumentsopa/policies/main.rego— 12 deny rules, 5 modification rules across 6 policy packsserver/sdam_model.py— Sequential Decision Analytics (Powell's framework) for modeling decision chains
Numbers: 65 Python tests, 20 OPA tests, 9 API endpoints, Docker deployment.
Why this might be useful for the cookbook
The MCP ecosystem is growing fast, and governance is becoming a bottleneck. A cookbook recipe could cover:
- Pattern: How to add a formal conflict resolution layer between OPA and the model
- When to use it: Any MCP server with >3 policy rules that can produce contradictory outputs
- The bridge pattern: Converting declarative policy outputs (Rego) into formal arguments
- Auditability: Returning the attack graph as part of the API response for compliance
I'm happy to write this up as a proper cookbook recipe if there's interest, or to adapt any part of the existing code for inclusion.
Related work
- Dung, P.M. (1995). "On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games." Artificial Intelligence, 77(2), 321-357.
- Bai et al. (2022). "Constitutional AI: Harmlessness from AI feedback." — the system extends this by formalizing how constitutional principles resolve when they contradict.
- Kirchner et al. (2024). "Prover-Verifier Games improve legibility of LLM outputs." — structurally related: both formalize the idea that AI reasoning should be checkable by a less capable agent.
Links
- Repo: github.com/Leeladitya/claw
- Decision Arena (playable governance scenarios): github.com/Leeladitya/agora
- Research paper (multi-constitutional extension): leed.guru
-
- Extended technical writeup with formal analysis: lesswrong.com/posts/89bYQbNrRN9a8htpr
Feedback and pointers to related MCP governance work are welcome.
— Leela Aditya Annam (@Leeladitya)