Background
ECP v2.1 provides excellent per-action behavioral flags (retried, incomplete, hedged, error, human_review) computed from passive analysis. This is the right design — no LLM-as-Judge, no self-reporting.
The current spec doesn't define how to aggregate these per-action signals into session-level summaries, or track how those summaries drift across sessions over time.
The Gap
Given a chain of ECP records for a session, there's no standard formula for:
- Session-level behavioral score — what fraction of actions were completed without retry/error/incomplete?
- Calibration signal — was the agent's hedging rate appropriate for the task type?
- Cross-session trend — is this agent's behavioral profile drifting over a 30-day window?
These are computable from existing ECP records, but the computation method is unspecified.
Proposed: §9 Session-Level Aggregation Extension
Add an optional session summary record type:
{
"ecp": "1.0",
"id": "sess_a1b2c3d4e5f6a1b2",
"type": "session_summary",
"agent": "did:ecp:...",
"session_start": 1741766400000,
"session_end": 1741769000000,
"record_count": 47,
"flag_totals": {
"retried": 3,
"incomplete": 1,
"hedged": 12,
"error": 0,
"human_review": 2,
"a2a_delegated": 5,
"speed_anomaly": 1,
"high_latency": 4
},
"delivery_score": 0.894,
"calibration_flag_rate": 0.255,
"prev_session": "sess_previous..."
}
Where:
delivery_score = 1 - (retried + incomplete + error) / record_count
calibration_flag_rate = hedged / record_count (neutral signal — neither inherently good/bad, context-dependent)
prev_session — chains session summaries the same way records are chained, enabling drift detection
Why This Matters
Cross-session drift is invisible in single-record analysis. An agent that degrades slowly over 200 sessions won't trigger any per-action flag threshold — but a session-summary chain shows the slope clearly.
This also gives operators a cacheable, indexable summary format: instead of re-scanning thousands of raw ECP records to answer "is this agent's delivery rate declining?", a session_summary chain can be queried in O(n) where n is session count.
Notes
- All fields derivable from existing ECP records — no new data sources required
session_summary is optional; it doesn't change existing record format
delivery_score formula is intentionally minimal — same computation used in PDR (Protocol Delivery Reliability) for cross-session behavioral measurement
- The chaining pattern (
prev_session) mirrors ECP's existing chain.prev design
Happy to draft the spec section if this direction sounds useful.
Background
ECP v2.1 provides excellent per-action behavioral flags (
retried,incomplete,hedged,error,human_review) computed from passive analysis. This is the right design — no LLM-as-Judge, no self-reporting.The current spec doesn't define how to aggregate these per-action signals into session-level summaries, or track how those summaries drift across sessions over time.
The Gap
Given a chain of ECP records for a session, there's no standard formula for:
These are computable from existing ECP records, but the computation method is unspecified.
Proposed: §9 Session-Level Aggregation Extension
Add an optional session summary record type:
{ "ecp": "1.0", "id": "sess_a1b2c3d4e5f6a1b2", "type": "session_summary", "agent": "did:ecp:...", "session_start": 1741766400000, "session_end": 1741769000000, "record_count": 47, "flag_totals": { "retried": 3, "incomplete": 1, "hedged": 12, "error": 0, "human_review": 2, "a2a_delegated": 5, "speed_anomaly": 1, "high_latency": 4 }, "delivery_score": 0.894, "calibration_flag_rate": 0.255, "prev_session": "sess_previous..." }Where:
delivery_score= 1 - (retried + incomplete + error) / record_countcalibration_flag_rate= hedged / record_count (neutral signal — neither inherently good/bad, context-dependent)prev_session— chains session summaries the same way records are chained, enabling drift detectionWhy This Matters
Cross-session drift is invisible in single-record analysis. An agent that degrades slowly over 200 sessions won't trigger any per-action flag threshold — but a session-summary chain shows the slope clearly.
This also gives operators a cacheable, indexable summary format: instead of re-scanning thousands of raw ECP records to answer "is this agent's delivery rate declining?", a session_summary chain can be queried in O(n) where n is session count.
Notes
session_summaryis optional; it doesn't change existing record formatdelivery_scoreformula is intentionally minimal — same computation used in PDR (Protocol Delivery Reliability) for cross-session behavioral measurementprev_session) mirrors ECP's existingchain.prevdesignHappy to draft the spec section if this direction sounds useful.