|
| 1 | +# GossipSub Metrics Specification |
| 2 | + |
| 3 | +> Standardized optional metrics for GossipSub implementations to enable consistent and comparable performance monitoring |
| 4 | +
|
| 5 | +| Lifecycle Stage | Maturity | Status | Latest Revision | |
| 6 | +|-----------------|---------------|--------|-----------------| |
| 7 | +| 1A | Working Draft | Active | r0, 2025-01-25 | |
| 8 | + |
| 9 | +Authors: [@dennis-tra] |
| 10 | + |
| 11 | +Interest Group: TBD |
| 12 | + |
| 13 | +[@dennis-tra]: https://github.com/dennis-tra |
| 14 | + |
| 15 | +See the [lifecycle document][lifecycle-spec] for context about the maturity level and spec status. |
| 16 | + |
| 17 | +[lifecycle-spec]: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md |
| 18 | + |
| 19 | + |
| 20 | +## Table of Contents |
| 21 | + |
| 22 | +- [GossipSub Metrics Specification](#gossipsub-metrics-specification) |
| 23 | + - [Table of Contents](#table-of-contents) |
| 24 | + - [Motivation](#motivation) |
| 25 | + - [Metric Categories](#metric-categories) |
| 26 | + - [1. Peer Management Metrics](#1-peer-management-metrics) |
| 27 | + - [2. Message Flow Metrics](#2-message-flow-metrics) |
| 28 | + - [3. Protocol Control Metrics](#3-protocol-control-metrics) |
| 29 | + - [4. Performance \& Health Metrics](#4-performance--health-metrics) |
| 30 | + - [Metric Specifications](#metric-specifications) |
| 31 | + - [Naming Conventions](#naming-conventions) |
| 32 | + - [Metric Types](#metric-types) |
| 33 | + - [Standard Labels](#standard-labels) |
| 34 | + - [Metric Definitions](#metric-definitions) |
| 35 | + - [Metric Update Semantics](#metric-update-semantics) |
| 36 | + - [Implementation Guidelines](#implementation-guidelines) |
| 37 | + - [Metric Collection Performance](#metric-collection-performance) |
| 38 | + - [Configuration Options](#configuration-options) |
| 39 | + - [Backward Compatibility](#backward-compatibility) |
| 40 | + - [Prometheus Export Format](#prometheus-export-format) |
| 41 | + - [Example Prometheus Output](#example-prometheus-output) |
| 42 | + - [Security Considerations](#security-considerations) |
| 43 | + |
| 44 | + |
| 45 | +## Motivation |
| 46 | + |
| 47 | +GossipSub implementations across different programming languages currently expose varying sets of metrics for observability and performance monitoring. This inconsistency makes it challenging for, e.g., node operators to deploy unified monitoring dashboards across heterogeneous deployments, compare performance characteristics between different implementations, diagnose network health issues using standardized indicators, and create portable alerting rules and runbooks. |
| 48 | + |
| 49 | +This specification defines a standardized set of **optional Prometheus-style metrics** that GossipSub implementations MAY support to enable consistent observability. These metrics are designed to be |
| 50 | + |
| 51 | +1. **Opt-in**: Implementations can continue using their existing metrics without breaking changes |
| 52 | +2. **Diagnostic-focused**: Help operators identify when GossipSub is experiencing issues or performing suboptimally |
| 53 | +3. **Cross-implementation compatible**: Enable meaningful performance comparisons across Go, Rust, JavaScript, and other implementations |
| 54 | + |
| 55 | + |
| 56 | +Scope and goals |
| 57 | + |
| 58 | +- Definition of standardized metric names, types, and labels |
| 59 | +- Semantic specifications for when metrics should be updated |
| 60 | +- Export format recommendations for interoperability |
| 61 | +- Implementation guidelines for performance-conscious metric collection |
| 62 | + |
| 63 | + |
| 64 | +## Metric Categories |
| 65 | + |
| 66 | +This specification organizes metrics into four primary categories: |
| 67 | + |
| 68 | +### 1. Peer Management Metrics |
| 69 | +Metrics related to peer relationships, mesh topology, and peer lifecycle events. |
| 70 | + |
| 71 | +### 2. Message Flow Metrics |
| 72 | +Metrics tracking message processing, validation, delivery, and publishing operations. |
| 73 | + |
| 74 | +### 3. Protocol Control Metrics |
| 75 | +Metrics for RPC messages, gossip control operations, and protocol-specific events. |
| 76 | + |
| 77 | +### 4. Performance & Health Metrics |
| 78 | +Metrics indicating system performance, resource utilization, and network health indicators. |
| 79 | + |
| 80 | +## Metric Specifications |
| 81 | + |
| 82 | +### Naming Conventions |
| 83 | + |
| 84 | +All standardized metric names MUST follow these conventions: |
| 85 | +- Use `gossipsub_` prefix for all metrics |
| 86 | +- Use snake_case for metric names and label names |
| 87 | +- Use descriptive names that clearly indicate what is being measured |
| 88 | +- Include units in metric names where applicable (e.g., `_duration_seconds`, `_bytes_total`) |
| 89 | + |
| 90 | +### Metric Types |
| 91 | + |
| 92 | +This specification defines Prometheus-style metrics using three standard types: |
| 93 | + |
| 94 | +- **Counter**: Monotonically increasing values (e.g., message counts, error counts) |
| 95 | +- **Gauge**: Values that can increase or decrease (e.g., peer counts, mesh size) |
| 96 | +- **Histogram**: Distribution of values with configurable buckets (e.g., latency, scores) |
| 97 | + |
| 98 | +### Standard Labels |
| 99 | + |
| 100 | +The following labels MAY be applied to metrics where semantically appropriate: |
| 101 | + |
| 102 | +- `topic`: The pubsub topic name (when metric is topic-specific) |
| 103 | +- `peer_id`: Peer identifier (when metric is peer-specific, use judiciously for cardinality) |
| 104 | +- `reason`: Categorization of why an event occurred (e.g., rejection reason, penalty type) |
| 105 | +- `message_type`: Type of RPC or control message |
| 106 | +- `validation_result`: Result of message validation (accept, reject, ignore) |
| 107 | + |
| 108 | +## Metric Definitions |
| 109 | + |
| 110 | +All metrics follow Prometheus naming conventions and use the `gossipsub_` prefix. The following table defines the complete set of standardized metrics: |
| 111 | + |
| 112 | +| Metric Name | Type | Labels | Description | |
| 113 | +|-------------|------|--------|--------------| |
| 114 | +| **Peer Management** | |
| 115 | +| `gossipsub_peers_total` | Gauge | `topic` (optional) | Current number of known peers, optionally segmented by topic | |
| 116 | +| `gossipsub_mesh_peers_total` | Gauge | `topic` (required) | Current number of peers in the mesh for each topic | |
| 117 | +| `gossipsub_peer_graft_total` | Counter | `topic` (required) | Total number of GRAFT messages sent, by topic | |
| 118 | +| `gossipsub_peer_prune_total` | Counter | `topic` (required), `reason` (optional) | Total number of PRUNE messages sent, by topic and optional reason | |
| 119 | +| `gossipsub_peer_score` | Histogram | `topic` (optional) | Distribution of peer scores | |
| 120 | +| **Message Flow** | |
| 121 | +| `gossipsub_message_received_total` | Counter | `topic` (required), `validation_result` (optional) | Total messages received for processing, optionally by validation result | |
| 122 | +| `gossipsub_message_delivered_total` | Counter | `topic` (required) | Total messages successfully delivered to local subscribers | |
| 123 | +| `gossipsub_message_rejected_total` | Counter | `topic` (required), `reason` (optional) | Total messages rejected during validation, optionally by reason | |
| 124 | +| `gossipsub_message_duplicate_total` | Counter | `topic` (required) | Total duplicate messages detected and discarded | |
| 125 | +| `gossipsub_message_published_total` | Counter | `topic` (required) | Total messages published by local node | |
| 126 | +| `gossipsub_message_latency_seconds` | Histogram | `topic` (optional) | End-to-end message delivery latency in seconds | |
| 127 | +| **Protocol Control** | |
| 128 | +| `gossipsub_rpc_received_total` | Counter | `message_type` (required) | Total RPC messages received by type (publish, subscribe, unsubscribe, graft, prune, ihave, iwant, idontwant) | |
| 129 | +| `gossipsub_rpc_sent_total` | Counter | `message_type` (required) | Total RPC messages sent by type | |
| 130 | +| `gossipsub_ihave_sent_total` | Counter | `topic` (required) | Total IHAVE control messages sent per topic | |
| 131 | +| `gossipsub_iwant_sent_total` | Counter | `topic` (required) | Total IWANT control messages sent per topic | |
| 132 | +| `gossipsub_idontwant_sent_total` | Counter | `topic` (required) | Total IDONTWANT control messages sent per topic | |
| 133 | +| **Performance & Health** | |
| 134 | +| `gossipsub_heartbeat_duration_seconds` | Histogram | None | Time spent processing each heartbeat operation | |
| 135 | +| `gossipsub_peer_throttled_total` | Counter | `reason` (optional) | Total number of times peers have been throttled | |
| 136 | +| `gossipsub_backoff_violations_total` | Counter | None | Total attempts to reconnect before backoff period completion | |
| 137 | +| `gossipsub_score_penalty_total` | Counter | `penalty_type` (required), `topic` (optional) | Total peer scoring penalties applied by type | |
| 138 | + |
| 139 | +### Metric Update Semantics |
| 140 | + |
| 141 | +**Counters** are incremented when: |
| 142 | +- `*_total` metrics: Each time the corresponding event occurs (message sent/received, peer action, etc.) |
| 143 | +- Events are counted at the protocol level, not application level |
| 144 | + |
| 145 | +**Gauges** are updated when: |
| 146 | +- `*_peers_total`: Peers are added/removed from peer tracking or topic meshes |
| 147 | +- Values reflect current state at time of observation |
| 148 | + |
| 149 | +**Histograms** are updated when: |
| 150 | +- `gossipsub_peer_score`: During peer scoring operations (recommended buckets: `[-100, -10, -1, 0, 1, 10, 100, +Inf]`) |
| 151 | +- `*_latency_seconds`: When latency measurements are available (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, +Inf]`) |
| 152 | +- `*_duration_seconds`: When timing operations complete (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf]`) |
| 153 | + |
| 154 | +## Implementation Guidelines |
| 155 | + |
| 156 | +### Metric Collection Performance |
| 157 | + |
| 158 | +Implementations SHOULD: |
| 159 | +- Use efficient metric collection mechanisms that minimize impact on message processing latency |
| 160 | +- Implement metric updates asynchronously where possible |
| 161 | +- Provide configuration options to disable metric collection entirely |
| 162 | +- Consider metric cardinality implications, especially for topic-specific metrics in high-topic-count environments |
| 163 | + |
| 164 | +### Configuration Options |
| 165 | + |
| 166 | +Implementations SHOULD provide configuration to: |
| 167 | +- Enable/disable entire metric categories |
| 168 | +- Configure histogram bucket boundaries based on expected value distributions |
| 169 | +- Set maximum cardinality limits for high-cardinality labels like `topic` |
| 170 | +- Control metric export formats and destinations |
| 171 | + |
| 172 | +### Backward Compatibility |
| 173 | + |
| 174 | +Implementations adopting this specification: |
| 175 | +- MUST NOT remove or modify existing metrics without appropriate deprecation periods |
| 176 | +- MAY implement these metrics alongside existing metrics systems |
| 177 | +- SHOULD clearly document which standardized metrics are supported |
| 178 | + |
| 179 | +## Prometheus Export Format |
| 180 | + |
| 181 | +Implementations MUST export metrics in Prometheus format following these conventions: |
| 182 | + |
| 183 | +- Use standard Prometheus metric types (counter, gauge, histogram) |
| 184 | +- Include `_total` suffix for counters following Prometheus conventions |
| 185 | +- Use `_seconds` suffix for time-based metrics |
| 186 | +- Provide help text describing each metric's purpose |
| 187 | +- Follow Prometheus naming best practices for metric and label names |
| 188 | + |
| 189 | +### Example Prometheus Output |
| 190 | + |
| 191 | +``` |
| 192 | +# HELP gossipsub_mesh_peers_total Current number of peers in the mesh for each topic |
| 193 | +# TYPE gossipsub_mesh_peers_total gauge |
| 194 | +gossipsub_mesh_peers_total{topic="ipfs-dht"} 8 |
| 195 | +gossipsub_mesh_peers_total{topic="libp2p-announce"} 12 |
| 196 | +
|
| 197 | +# HELP gossipsub_message_received_total Total messages received for processing |
| 198 | +# TYPE gossipsub_message_received_total counter |
| 199 | +gossipsub_message_received_total{topic="ipfs-dht",validation_result="accept"} 1543 |
| 200 | +gossipsub_message_received_total{topic="ipfs-dht",validation_result="reject"} 23 |
| 201 | +
|
| 202 | +# HELP gossipsub_heartbeat_duration_seconds Time spent processing each heartbeat operation |
| 203 | +# TYPE gossipsub_heartbeat_duration_seconds histogram |
| 204 | +gossipsub_heartbeat_duration_seconds_bucket{le="0.001"} 45 |
| 205 | +gossipsub_heartbeat_duration_seconds_bucket{le="0.005"} 123 |
| 206 | +gossipsub_heartbeat_duration_seconds_bucket{le="+Inf"} 150 |
| 207 | +gossipsub_heartbeat_duration_seconds_sum 0.456 |
| 208 | +gossipsub_heartbeat_duration_seconds_count 150 |
| 209 | +``` |
| 210 | + |
| 211 | +## Security Considerations |
| 212 | + |
| 213 | +When implementing these metrics, consider: |
| 214 | + |
| 215 | +- **Information Disclosure**: Topic names in metrics may reveal sensitive information about network usage patterns |
| 216 | +- **Cardinality Attacks**: Malicious peers could potentially cause high cardinality by creating many topics or using diverse peer IDs |
| 217 | +- **Resource Consumption**: Metric collection itself consumes memory and CPU resources that should be bounded |
| 218 | + |
| 219 | +Implementations SHOULD provide mechanisms to: |
| 220 | +- Hash or obfuscate sensitive label values |
| 221 | +- Limit the number of unique label combinations |
| 222 | +- Monitor and alert on metric collection resource usage |
0 commit comments