agntcy · gcarofig · Oct 2, 2025 · Oct 2, 2025
@@ -4,11 +4,12 @@ nav:
   - Agent Directory Service: dir
   - Messaging SDK: messaging
   - Identity: identity
+  - Observability and Evaluation: obs-and-eval
   - CSIT: csit
   - CoffeeAGNTCY: coffee-agntcy
   - Semantic SDK: semantic
   - Syntactic SDK: syntactic
   - Agent Workflow Server: agws
   - Agent Manifest: manifest
   - How-To Guides: how-to-guides
-  - How to Contribute: contributing.md
+  - How to Contribute: contributing.md
@@ -0,0 +1,4 @@
+nav:
+ - Introduction: observe-and-eval.md
+ - Observe SDK: observe-sdk.md
+ - Evaluations: evaluation.md
@@ -0,0 +1,39 @@
+# Evaluation
+
+The Observe SDK emits raw telemetry (traces, metrics, and optional logs) to an OpenTelemetry (OTel) Collector. The collector exports that telemetry to a storage backend where it becomes a durable, queryable substrate. On top of this substrate we layer **evaluation**: deriving higher‑order insights (quality, efficiency, reliability, safety) from factual execution data. Two core building blocks enable this: the **API Layer** and the **Metrics Computation Engine (MCE)**.
+
+## API Layer
+
+The API Layer abstracts the underlying database so that we are not locked into a single storage technology. While ClickHouse is the initial implementation, the interface is intentionally narrow (session lookup, span search, metric fetch, write‑back of computed artifacts) to keep alternative backends (e.g., PostgreSQL, BigQuery, Elastic, Parquet lake) easily pluggable.
+
+Capabilities:
+- Retrieve raw session traces (for replay, graph reconstruction, diffing).
+- Fetch primary metrics emitted directly by the SDK (e.g., latency histograms, token usage counters).
+- Provide bounded, pagination‑friendly queries for the MCE (avoids heavy ad‑hoc joins inside the engine).
+- Enforce schema normalization/translation so upstream differences (SDK versions, framework variants) do not leak into evaluator logic.
+
+Design goals:
+- **Decoupling:** Evaluation logic never embeds vendor‑specific SQL.
+- **Portability:** Swap storage by implementing a small provider contract.
+- **Consistency:** Uniform response envelopes (metadata + data + pagination cursor).
+
+---
+## Metrics Computation Engine (MCE)
+
+The Metrics Computation Engine computes **derived** metrics at multiple aggregation levels:
+- **Span level:** e.g., answer relevance for a single LLM call or tool invocation success rate.
+- **Session level:** overall MAS completion status, agent collaboration success rate, delegation accuracy.
+- **Population level:** metrics computed over a set of sessions, like graph determinism score.
+
+The MCE uses a plugin architecture, making it straightforward to add new metrics.  
+It can run either as:
+- A stand‑alone service exposed through a REST API, or
+- An embeddable SDK integrated into a third‑party data or analytics platform.
+
+Computed metrics can be:
+- Stored back into the database via the API Layer for longitudinal analysis, and/or
+- Returned directly to the caller for immediate consumption.
+
+## Detailed Usage
+
+For detailed usage and examples, see the [telemetry-hub repository](https://github.com/agntcy/telemetry-hub).
@@ -0,0 +1,19 @@
+# Observability and Evaluation
+
+## Introduction
+
+In the Internet of Agents (IoA) vision, multiple agents collaborate — making sequential and sometimes parallel decisions — forming a multi-agent system (MAS). As the number and complexity of involved agents increase, it becomes increasingly challenging to understand how the system arrived at a particular state. Unlike traditional software, a MAS is mode up of many independant agents, interacting in unpredictable way, making it hard to see how the information is shared, how decisions are made and how the system works as a whole. This is where *observability* comes in: it provides visibility into the execution of a MAS by recording each action taken, along with its outcome, by the agents. This yields a complete view of a run: what each agent did, which agents interacted, which tools were used, etc. Observability must extend beyond traditional monitoring to inculde instrumentation of communication and reasoning layers, to give insights into message flows, interaction patterns and the decision-making behind agent collaboration.
+
+While observability provides *factual information* about a MAS, *evaluation* judges its *quality and effectiveness*. Using the data produced by observability, evaluation surfaces performance through metrics that measure dimensions important to the user. For example, metrics can measure the relevance of the final output relative to the initial input, or the overall efficiency of the workflow (were there unnecessary loops during execution?). In effect, evaluation transforms raw telemetry into performance, reliability, cost-efficiency, and security signals. Evaluation can also be applied across multiple sessions (executions) over time to identify trends or performance decay.
+
+## AGNTCY Observability and Evaluation Solution
+
+At AGNTCY, we provide a set of components that, when combined, deliver an observability and evaluation solution for multi-agent systems. These components are shown in the diagram below:
+
+![Observability and Evaluation architecture](../assets/obs-and-eval/observe-and-eval-arch.png)
+
+In summary, we provide:
+- An AGNTCY observability data schema that provides comprehensive coverage for MAS telemetry.
+- An Observability SDK to instrument agents as well as agentic protocols (e.g., SLIM and A2A).
+- A Metrics Computation Engine to derive higher-level metrics from raw telemetry.
+- An Observability API to query traces and metrics produced by the SDK.
@@ -0,0 +1,34 @@
+# Observability SDK
+
+## AGNTCY Observability Data Schema
+
+We provide the AGNTCY Observability Data Schema, an extension of OpenTelemetry (OTel) and aligned with LLM semantic conventions for generative AI systems. The schema is designed to deliver comprehensive observability for multi‑agent systems (MAS), enabling detailed monitoring, analysis, and evaluation of performance and behavior.
+
+Its goal is to standardize telemetry across diverse agent frameworks, enriching core OTel structures with MAS‑specific concepts such as agent collaboration success rate, MAS response time, and task delegation accuracy.
+
+For more information, see the [schema directory in the observe repository](https://github.com/agntcy/observe/tree/main/schema).
+
+## Observe SDK
+
+We provide a framework‑agnostic, OTel‑compliant observability SDK for multi‑agent systems. Each agent in the MAS can be instrumented by applying lightweight decorators to key functions or by using native OpenTelemetry primitives directly. The SDK exports metrics and traces using the [OpenTelemetry (OTel)](https://opentelemetry.io/) standard.
+
+OTel is central to effective MAS observability as it provides:
+- **Unified telemetry collection:** A single, vendor‑neutral API/SDK for metrics, traces, and (optionally) logs across heterogeneous agent components.
+- **Consistent instrumentation & standardization:** Semantic conventions and stable APIs ensure uniform telemetry shape, enabling backend flexibility without re‑instrumentation.
+- **Distributed tracing for complex workflows:** OTel’s tracing links agent spans, LLM calls, and tool interactions into an end‑to‑end execution graph—critical for understanding non‑deterministic decision paths and debugging across agent boundaries.
+
+### Agentic Protocol Instrumentation
+
+The Observe SDK also instruments major agent communication and coordination protocols. Currently supported: Agent‑to‑Agent (A2A), Secure Low‑latency Messaging (SLIM), and the Model Context Protocol (MCP). This yields traceability for agent‑to‑agent and agent‑to‑tool interactions within a MAS.
+
+### End‑to‑End Trace Recomposition
+
+A core MAS challenge is decentralization: independent autonomous agents, when instrumented in isolation, produce fragmented trace trees (one per agent). To avoid stitching telemetry manually, we leverage OTel context propagation to carry identifiers (e.g., session, user, workflow) across boundaries. This preserves causal linkage and yields a coherent, recomposed execution trace spanning all participating agents.
+
+## Translator
+
+In heterogeneous environments, different OTel‑compliant SDKs may emit telemetry following divergent schemas (there is no universal MAS schema standard yet). A translator layer can normalize these differences—mapping attributes, renaming fields, or redacting sensitive keys. The OpenTelemetry Collector supports this via configurable [processors](https://opentelemetry.io/docs/collector/transforming-telemetry/) that transform telemetry in transit (e.g., attribute rename, drop, redaction), enabling schema convergence without rewriting upstream code.
+
+## Links
+
+To get started with the Observe SDK, visit the [repository](https://github.com/agntcy/observe/).