Skip to content

Commit 8251acf

Browse files
committed
add: metrics spec draft
1 parent e5af917 commit 8251acf

File tree

1 file changed

+222
-0
lines changed

1 file changed

+222
-0
lines changed

metrics/README.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# GossipSub Metrics Specification
2+
3+
> Standardized optional metrics for GossipSub implementations to enable consistent and comparable performance monitoring
4+
5+
| Lifecycle Stage | Maturity | Status | Latest Revision |
6+
|-----------------|---------------|--------|-----------------|
7+
| 1A | Working Draft | Active | r0, 2025-01-25 |
8+
9+
Authors: [@dennis-tra]
10+
11+
Interest Group: TBD
12+
13+
[@dennis-tra]: https://github.com/dennis-tra
14+
15+
See the [lifecycle document][lifecycle-spec] for context about the maturity level and spec status.
16+
17+
[lifecycle-spec]: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md
18+
19+
20+
## Table of Contents
21+
22+
- [GossipSub Metrics Specification](#gossipsub-metrics-specification)
23+
- [Table of Contents](#table-of-contents)
24+
- [Motivation](#motivation)
25+
- [Metric Categories](#metric-categories)
26+
- [1. Peer Management Metrics](#1-peer-management-metrics)
27+
- [2. Message Flow Metrics](#2-message-flow-metrics)
28+
- [3. Protocol Control Metrics](#3-protocol-control-metrics)
29+
- [4. Performance \& Health Metrics](#4-performance--health-metrics)
30+
- [Metric Specifications](#metric-specifications)
31+
- [Naming Conventions](#naming-conventions)
32+
- [Metric Types](#metric-types)
33+
- [Standard Labels](#standard-labels)
34+
- [Metric Definitions](#metric-definitions)
35+
- [Metric Update Semantics](#metric-update-semantics)
36+
- [Implementation Guidelines](#implementation-guidelines)
37+
- [Metric Collection Performance](#metric-collection-performance)
38+
- [Configuration Options](#configuration-options)
39+
- [Backward Compatibility](#backward-compatibility)
40+
- [Prometheus Export Format](#prometheus-export-format)
41+
- [Example Prometheus Output](#example-prometheus-output)
42+
- [Security Considerations](#security-considerations)
43+
44+
45+
## Motivation
46+
47+
GossipSub implementations across different programming languages currently expose varying sets of metrics for observability and performance monitoring. This inconsistency makes it challenging for, e.g., node operators to deploy unified monitoring dashboards across heterogeneous deployments, compare performance characteristics between different implementations, diagnose network health issues using standardized indicators, and create portable alerting rules and runbooks.
48+
49+
This specification defines a standardized set of **optional Prometheus-style metrics** that GossipSub implementations MAY support to enable consistent observability. These metrics are designed to be
50+
51+
1. **Opt-in**: Implementations can continue using their existing metrics without breaking changes
52+
2. **Diagnostic-focused**: Help operators identify when GossipSub is experiencing issues or performing suboptimally
53+
3. **Cross-implementation compatible**: Enable meaningful performance comparisons across Go, Rust, JavaScript, and other implementations
54+
55+
56+
Scope and goals
57+
58+
- Definition of standardized metric names, types, and labels
59+
- Semantic specifications for when metrics should be updated
60+
- Export format recommendations for interoperability
61+
- Implementation guidelines for performance-conscious metric collection
62+
63+
64+
## Metric Categories
65+
66+
This specification organizes metrics into four primary categories:
67+
68+
### 1. Peer Management Metrics
69+
Metrics related to peer relationships, mesh topology, and peer lifecycle events.
70+
71+
### 2. Message Flow Metrics
72+
Metrics tracking message processing, validation, delivery, and publishing operations.
73+
74+
### 3. Protocol Control Metrics
75+
Metrics for RPC messages, gossip control operations, and protocol-specific events.
76+
77+
### 4. Performance & Health Metrics
78+
Metrics indicating system performance, resource utilization, and network health indicators.
79+
80+
## Metric Specifications
81+
82+
### Naming Conventions
83+
84+
All standardized metric names MUST follow these conventions:
85+
- Use `gossipsub_` prefix for all metrics
86+
- Use snake_case for metric names and label names
87+
- Use descriptive names that clearly indicate what is being measured
88+
- Include units in metric names where applicable (e.g., `_duration_seconds`, `_bytes_total`)
89+
90+
### Metric Types
91+
92+
This specification defines Prometheus-style metrics using three standard types:
93+
94+
- **Counter**: Monotonically increasing values (e.g., message counts, error counts)
95+
- **Gauge**: Values that can increase or decrease (e.g., peer counts, mesh size)
96+
- **Histogram**: Distribution of values with configurable buckets (e.g., latency, scores)
97+
98+
### Standard Labels
99+
100+
The following labels MAY be applied to metrics where semantically appropriate:
101+
102+
- `topic`: The pubsub topic name (when metric is topic-specific)
103+
- `peer_id`: Peer identifier (when metric is peer-specific, use judiciously for cardinality)
104+
- `reason`: Categorization of why an event occurred (e.g., rejection reason, penalty type)
105+
- `message_type`: Type of RPC or control message
106+
- `validation_result`: Result of message validation (accept, reject, ignore)
107+
108+
## Metric Definitions
109+
110+
All metrics follow Prometheus naming conventions and use the `gossipsub_` prefix. The following table defines the complete set of standardized metrics:
111+
112+
| Metric Name | Type | Labels | Description |
113+
|-------------|------|--------|--------------|
114+
| **Peer Management** |
115+
| `gossipsub_peers_total` | Gauge | `topic` (optional) | Current number of known peers, optionally segmented by topic |
116+
| `gossipsub_mesh_peers_total` | Gauge | `topic` (required) | Current number of peers in the mesh for each topic |
117+
| `gossipsub_peer_graft_total` | Counter | `topic` (required) | Total number of GRAFT messages sent, by topic |
118+
| `gossipsub_peer_prune_total` | Counter | `topic` (required), `reason` (optional) | Total number of PRUNE messages sent, by topic and optional reason |
119+
| `gossipsub_peer_score` | Histogram | `topic` (optional) | Distribution of peer scores |
120+
| **Message Flow** |
121+
| `gossipsub_message_received_total` | Counter | `topic` (required), `validation_result` (optional) | Total messages received for processing, optionally by validation result |
122+
| `gossipsub_message_delivered_total` | Counter | `topic` (required) | Total messages successfully delivered to local subscribers |
123+
| `gossipsub_message_rejected_total` | Counter | `topic` (required), `reason` (optional) | Total messages rejected during validation, optionally by reason |
124+
| `gossipsub_message_duplicate_total` | Counter | `topic` (required) | Total duplicate messages detected and discarded |
125+
| `gossipsub_message_published_total` | Counter | `topic` (required) | Total messages published by local node |
126+
| `gossipsub_message_latency_seconds` | Histogram | `topic` (optional) | End-to-end message delivery latency in seconds |
127+
| **Protocol Control** |
128+
| `gossipsub_rpc_received_total` | Counter | `message_type` (required) | Total RPC messages received by type (publish, subscribe, unsubscribe, graft, prune, ihave, iwant, idontwant) |
129+
| `gossipsub_rpc_sent_total` | Counter | `message_type` (required) | Total RPC messages sent by type |
130+
| `gossipsub_ihave_sent_total` | Counter | `topic` (required) | Total IHAVE control messages sent per topic |
131+
| `gossipsub_iwant_sent_total` | Counter | `topic` (required) | Total IWANT control messages sent per topic |
132+
| `gossipsub_idontwant_sent_total` | Counter | `topic` (required) | Total IDONTWANT control messages sent per topic |
133+
| **Performance & Health** |
134+
| `gossipsub_heartbeat_duration_seconds` | Histogram | None | Time spent processing each heartbeat operation |
135+
| `gossipsub_peer_throttled_total` | Counter | `reason` (optional) | Total number of times peers have been throttled |
136+
| `gossipsub_backoff_violations_total` | Counter | None | Total attempts to reconnect before backoff period completion |
137+
| `gossipsub_score_penalty_total` | Counter | `penalty_type` (required), `topic` (optional) | Total peer scoring penalties applied by type |
138+
139+
### Metric Update Semantics
140+
141+
**Counters** are incremented when:
142+
- `*_total` metrics: Each time the corresponding event occurs (message sent/received, peer action, etc.)
143+
- Events are counted at the protocol level, not application level
144+
145+
**Gauges** are updated when:
146+
- `*_peers_total`: Peers are added/removed from peer tracking or topic meshes
147+
- Values reflect current state at time of observation
148+
149+
**Histograms** are updated when:
150+
- `gossipsub_peer_score`: During peer scoring operations (recommended buckets: `[-100, -10, -1, 0, 1, 10, 100, +Inf]`)
151+
- `*_latency_seconds`: When latency measurements are available (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, +Inf]`)
152+
- `*_duration_seconds`: When timing operations complete (recommended buckets: `[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, +Inf]`)
153+
154+
## Implementation Guidelines
155+
156+
### Metric Collection Performance
157+
158+
Implementations SHOULD:
159+
- Use efficient metric collection mechanisms that minimize impact on message processing latency
160+
- Implement metric updates asynchronously where possible
161+
- Provide configuration options to disable metric collection entirely
162+
- Consider metric cardinality implications, especially for topic-specific metrics in high-topic-count environments
163+
164+
### Configuration Options
165+
166+
Implementations SHOULD provide configuration to:
167+
- Enable/disable entire metric categories
168+
- Configure histogram bucket boundaries based on expected value distributions
169+
- Set maximum cardinality limits for high-cardinality labels like `topic`
170+
- Control metric export formats and destinations
171+
172+
### Backward Compatibility
173+
174+
Implementations adopting this specification:
175+
- MUST NOT remove or modify existing metrics without appropriate deprecation periods
176+
- MAY implement these metrics alongside existing metrics systems
177+
- SHOULD clearly document which standardized metrics are supported
178+
179+
## Prometheus Export Format
180+
181+
Implementations MUST export metrics in Prometheus format following these conventions:
182+
183+
- Use standard Prometheus metric types (counter, gauge, histogram)
184+
- Include `_total` suffix for counters following Prometheus conventions
185+
- Use `_seconds` suffix for time-based metrics
186+
- Provide help text describing each metric's purpose
187+
- Follow Prometheus naming best practices for metric and label names
188+
189+
### Example Prometheus Output
190+
191+
```
192+
# HELP gossipsub_mesh_peers_total Current number of peers in the mesh for each topic
193+
# TYPE gossipsub_mesh_peers_total gauge
194+
gossipsub_mesh_peers_total{topic="ipfs-dht"} 8
195+
gossipsub_mesh_peers_total{topic="libp2p-announce"} 12
196+
197+
# HELP gossipsub_message_received_total Total messages received for processing
198+
# TYPE gossipsub_message_received_total counter
199+
gossipsub_message_received_total{topic="ipfs-dht",validation_result="accept"} 1543
200+
gossipsub_message_received_total{topic="ipfs-dht",validation_result="reject"} 23
201+
202+
# HELP gossipsub_heartbeat_duration_seconds Time spent processing each heartbeat operation
203+
# TYPE gossipsub_heartbeat_duration_seconds histogram
204+
gossipsub_heartbeat_duration_seconds_bucket{le="0.001"} 45
205+
gossipsub_heartbeat_duration_seconds_bucket{le="0.005"} 123
206+
gossipsub_heartbeat_duration_seconds_bucket{le="+Inf"} 150
207+
gossipsub_heartbeat_duration_seconds_sum 0.456
208+
gossipsub_heartbeat_duration_seconds_count 150
209+
```
210+
211+
## Security Considerations
212+
213+
When implementing these metrics, consider:
214+
215+
- **Information Disclosure**: Topic names in metrics may reveal sensitive information about network usage patterns
216+
- **Cardinality Attacks**: Malicious peers could potentially cause high cardinality by creating many topics or using diverse peer IDs
217+
- **Resource Consumption**: Metric collection itself consumes memory and CPU resources that should be bounded
218+
219+
Implementations SHOULD provide mechanisms to:
220+
- Hash or obfuscate sensitive label values
221+
- Limit the number of unique label combinations
222+
- Monitor and alert on metric collection resource usage

0 commit comments

Comments
 (0)