Skip to content

feat(proxy): tool definition drift detection (on_tool_drift) (#25)#59

Merged
olivrg merged 13 commits into
mainfrom
fix/tool-definition-drift
Jun 11, 2026
Merged

feat(proxy): tool definition drift detection (on_tool_drift) (#25)#59
olivrg merged 13 commits into
mainfrom
fix/tool-definition-drift

Conversation

@olivrg

@olivrg olivrg commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

Closes the MCP rug-pull / tool-poisoning gap reported in #25: Helio previously adopted every upstream tools/list definition wholesale — the annotation cache cleared and repopulated on each refresh and never looked at input schemas — so a tool whose readOnlyHint/destructiveHint flipped after review silently rewrote its own baseline and matching policy rules stopped firing.

  • Rewrite ToolAnnotationCache to baseline-and-diff: the entire tool definition (annotations, inputSchema, description, outputSchema, title, plus an other catch-all) is fingerprinted on first sight (canonical JSON, hardened against JSON-parsed __proto__ keys) and diffed on every tools/list. Baselines survive tool removal, so remove/re-add cycles can't reset them; reverting to the baseline clears the drift state.
  • Add policies.on_tool_drift: block | require_approval | log (default block). Drifted tools are denied / escalated through the approval channel / logged. In log mode, rules are evaluated against both the baseline and current annotations and the stricter decision wins (dry_run outranks the limit actions because it never forwards), so a definition flip cannot weaken enforcement in either direction.
  • Fail closed on duplicate tool names: a tools/list that repeats a name is ambiguous, so the tool is marked drifted (aspect: duplicate) and cannot be un-drifted by entry ordering within the same payload (closes a drift-suppression bypass found in external review).
  • Audit every drift event (policy_decision: tool_drift / tool_drift_reverted, immediate flush) from both the runtime tools/list path and startup priming; blocked calls get block_reason: tool_definition_drift with self-repair feedback; gated/logged calls carry evidence_chain.tool_drift { mode, changes } with the mode snapshotted at gate time (hot-reload safe).
  • Exclude drift-event records from the dashboard's allowed_total and top_tools aggregates so analytics keep representing tool-call outcomes; they remain visible in the feed, totals, and by-decision breakdown.
  • Policy evaluation now always uses the baseline annotations — the definitions the operator reviewed — never the latest upstream claim.

Behavior change (conservative default): a tool whose definition changes mid-session is now denied until the proxy restarts (re-baselines) or the upstream reverts. Set policies.on_tool_drift: log for observe-mode. Flagged in the CHANGELOG and the startup log.

Known limitations (tracked follow-ups): baselines are in-memory per process — a restart re-baselines from whatever the upstream currently reports (follow-up issue filed; startup log warns explicitly). Drift-escalated approval tickets do not yet show the drift detail to the approver.

Credit: Maaz (Interlock) on the MCP Discord for the report.

Closes #25

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (no functional changes)
  • Documentation
  • CI / build / tooling

Packages Affected

  • packages/proxy
  • packages/dashboard
  • packages/python-sdk
  • Root config / monorepo tooling
  • docs/
  • examples/

Checklist

  • I have read CONTRIBUTING.md
  • My code follows the existing style (ESLint + Prettier pass)
  • TypeScript strict mode — no any types or @ts-ignore without justification
  • I have added or updated tests for my changes
  • All CI checks pass (pnpm secrets:scan, pnpm docs:check:ci, pnpm audit --audit-level=high, pnpm build, pnpm lint, pnpm format:check, pnpm typecheck, pnpm test)
  • I have updated documentation if this changes user-facing behavior
  • Commit messages follow Conventional Commits (e.g. feat:, fix:, docs:)

How to Test

  1. Point Helio at any MCP server, start it, and note the startup line Annotation cache primed: <n> tool definitions baselined for drift detection ….
  2. Change a tool's annotations, schema, or description on the upstream mid-session (or replay a modified tools/list); confirm the console drift warning, a tool_drift audit record, and that a subsequent tools/call to that tool is denied with block_reason: tool_definition_drift.
  3. Set policies.on_tool_drift: log and repeat: the call proceeds, the audit record carries evidence_chain.tool_drift, and a deny rule matching either the baseline or the current annotations still fires (stricter-of-both).
  4. Revert the upstream definition: a tool_drift_reverted record is written and the tool unblocks.
  5. Send a tools/list with a duplicated tool name: the tool is gated (drifted_aspects: ["duplicate"]) regardless of entry order.
  6. pnpm --filter @gethelio/proxy test — full suite (1434 tests) passes, including the tool definition drift detection and call-gating suites.

Additional Context

@olivrg olivrg merged commit 92db2ef into main Jun 11, 2026
3 checks passed
@olivrg olivrg deleted the fix/tool-definition-drift branch June 11, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tool definition drift undetected - annotation/schema cache replaces wholesale, no baseline-and-diff (MCP rug-pull)

1 participant