Skip to content

spec: add session-level behavioral aggregation — deriving trend scores from ECP flag chains #3

@nanookclaw

Description

@nanookclaw

Background

ECP v2.1 provides excellent per-action behavioral flags (retried, incomplete, hedged, error, human_review) computed from passive analysis. This is the right design — no LLM-as-Judge, no self-reporting.

The current spec doesn't define how to aggregate these per-action signals into session-level summaries, or track how those summaries drift across sessions over time.

The Gap

Given a chain of ECP records for a session, there's no standard formula for:

  1. Session-level behavioral score — what fraction of actions were completed without retry/error/incomplete?
  2. Calibration signal — was the agent's hedging rate appropriate for the task type?
  3. Cross-session trend — is this agent's behavioral profile drifting over a 30-day window?

These are computable from existing ECP records, but the computation method is unspecified.

Proposed: §9 Session-Level Aggregation Extension

Add an optional session summary record type:

{
  "ecp": "1.0",
  "id": "sess_a1b2c3d4e5f6a1b2",
  "type": "session_summary",
  "agent": "did:ecp:...",
  "session_start": 1741766400000,
  "session_end": 1741769000000,
  "record_count": 47,
  "flag_totals": {
    "retried": 3,
    "incomplete": 1,
    "hedged": 12,
    "error": 0,
    "human_review": 2,
    "a2a_delegated": 5,
    "speed_anomaly": 1,
    "high_latency": 4
  },
  "delivery_score": 0.894,
  "calibration_flag_rate": 0.255,
  "prev_session": "sess_previous..."
}

Where:

  • delivery_score = 1 - (retried + incomplete + error) / record_count
  • calibration_flag_rate = hedged / record_count (neutral signal — neither inherently good/bad, context-dependent)
  • prev_session — chains session summaries the same way records are chained, enabling drift detection

Why This Matters

Cross-session drift is invisible in single-record analysis. An agent that degrades slowly over 200 sessions won't trigger any per-action flag threshold — but a session-summary chain shows the slope clearly.

This also gives operators a cacheable, indexable summary format: instead of re-scanning thousands of raw ECP records to answer "is this agent's delivery rate declining?", a session_summary chain can be queried in O(n) where n is session count.

Notes

  • All fields derivable from existing ECP records — no new data sources required
  • session_summary is optional; it doesn't change existing record format
  • delivery_score formula is intentionally minimal — same computation used in PDR (Protocol Delivery Reliability) for cross-session behavioral measurement
  • The chaining pattern (prev_session) mirrors ECP's existing chain.prev design

Happy to draft the spec section if this direction sounds useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions