spec: add session-level behavioral aggregation — deriving trend scores from ECP flag chains

## Background

ECP v2.1 provides excellent per-action behavioral flags (`retried`, `incomplete`, `hedged`, `error`, `human_review`) computed from passive analysis. This is the right design — no LLM-as-Judge, no self-reporting.

The current spec doesn't define how to aggregate these per-action signals into session-level summaries, or track how those summaries drift across sessions over time.

## The Gap

Given a chain of ECP records for a session, there's no standard formula for:
1. **Session-level behavioral score** — what fraction of actions were completed without retry/error/incomplete?
2. **Calibration signal** — was the agent's hedging rate appropriate for the task type?
3. **Cross-session trend** — is this agent's behavioral profile drifting over a 30-day window?

These are computable from existing ECP records, but the computation method is unspecified.

## Proposed: §9 Session-Level Aggregation Extension

Add an optional session summary record type:

```json
{
  "ecp": "1.0",
  "id": "sess_a1b2c3d4e5f6a1b2",
  "type": "session_summary",
  "agent": "did:ecp:...",
  "session_start": 1741766400000,
  "session_end": 1741769000000,
  "record_count": 47,
  "flag_totals": {
    "retried": 3,
    "incomplete": 1,
    "hedged": 12,
    "error": 0,
    "human_review": 2,
    "a2a_delegated": 5,
    "speed_anomaly": 1,
    "high_latency": 4
  },
  "delivery_score": 0.894,
  "calibration_flag_rate": 0.255,
  "prev_session": "sess_previous..."
}
```

Where:
- `delivery_score` = 1 - (retried + incomplete + error) / record_count
- `calibration_flag_rate` = hedged / record_count (neutral signal — neither inherently good/bad, context-dependent)
- `prev_session` — chains session summaries the same way records are chained, enabling drift detection

## Why This Matters

Cross-session drift is invisible in single-record analysis. An agent that degrades slowly over 200 sessions won't trigger any per-action flag threshold — but a session-summary chain shows the slope clearly.

This also gives operators a cacheable, indexable summary format: instead of re-scanning thousands of raw ECP records to answer "is this agent's delivery rate declining?", a session_summary chain can be queried in O(n) where n is session count.

## Notes

- All fields derivable from existing ECP records — no new data sources required
- `session_summary` is optional; it doesn't change existing record format
- `delivery_score` formula is intentionally minimal — same computation used in [PDR (Protocol Delivery Reliability)](https://doi.org/10.5281/zenodo.19339671) for cross-session behavioral measurement
- The chaining pattern (`prev_session`) mirrors ECP's existing `chain.prev` design

Happy to draft the spec section if this direction sounds useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: add session-level behavioral aggregation — deriving trend scores from ECP flag chains #3

Background

The Gap

Proposed: §9 Session-Level Aggregation Extension

Why This Matters

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

spec: add session-level behavioral aggregation — deriving trend scores from ECP flag chains #3

Description

Background

The Gap

Proposed: §9 Session-Level Aggregation Extension

Why This Matters

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions