Skip to content

Commit 000ec03

Browse files
markmcsimon-moatalhens
authored andcommitted
[docs] Update v1 metrics design doc (vllm-project#27332)
Signed-off-by: Simon Mo <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: atalhens <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: atalhens <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
1 parent f1a1b90 commit 000ec03

File tree

1 file changed

+69
-83
lines changed

1 file changed

+69
-83
lines changed

docs/design/metrics.md

Lines changed: 69 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Metrics
22

3-
Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
3+
vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.
44

55
## Objectives
66

7-
- Achieve parity of metrics between v0 and v1.
8-
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
9-
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
7+
- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
8+
- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
9+
- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.
1010

1111
## Background
1212

@@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows:
1717

1818
The mental model is that server-level metrics help explain the values of request-level metrics.
1919

20-
### v0 Metrics
21-
22-
In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
23-
24-
- `vllm:num_requests_running` (Gauge)
25-
- `vllm:num_requests_swapped` (Gauge)
26-
- `vllm:num_requests_waiting` (Gauge)
27-
- `vllm:gpu_cache_usage_perc` (Gauge)
28-
- `vllm:cpu_cache_usage_perc` (Gauge)
29-
- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
30-
- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
31-
- `vllm:prompt_tokens_total` (Counter)
32-
- `vllm:generation_tokens_total` (Counter)
33-
- `vllm:request_success_total` (Counter)
34-
- `vllm:request_prompt_tokens` (Histogram)
35-
- `vllm:request_generation_tokens` (Histogram)
36-
- `vllm:time_to_first_token_seconds` (Histogram)
37-
- `vllm:time_per_output_token_seconds` (Histogram)
38-
- `vllm:e2e_request_latency_seconds` (Histogram)
39-
- `vllm:request_queue_time_seconds` (Histogram)
40-
- `vllm:request_inference_time_seconds` (Histogram)
41-
- `vllm:request_prefill_time_seconds` (Histogram)
42-
- `vllm:request_decode_time_seconds` (Histogram)
43-
- `vllm:request_max_num_generation_tokens` (Histogram)
44-
- `vllm:num_preemptions_total` (Counter)
45-
- `vllm:cache_config_info` (Gauge)
46-
- `vllm:lora_requests_info` (Gauge)
47-
- `vllm:tokens_total` (Counter)
48-
- `vllm:iteration_tokens_total` (Histogram)
49-
- `vllm:time_in_queue_requests` (Histogram)
50-
- `vllm:model_forward_time_milliseconds` (Histogram)
51-
- `vllm:model_execute_time_milliseconds` (Histogram)
52-
- `vllm:request_params_n` (Histogram)
53-
- `vllm:request_params_max_tokens` (Histogram)
54-
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
55-
- `vllm:spec_decode_efficiency` (Gauge)
56-
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
57-
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
58-
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
20+
### Metrics Overview
21+
22+
### v1 Metrics
23+
24+
In v1, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
25+
26+
- `vllm:num_requests_running` (Gauge) - Number of requests currently running.
27+
- `vllm:num_requests_waiting` (Gauge) - Number of requests currently waiting.
28+
- `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1).
29+
- `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries.
30+
- `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits.
31+
- `vllm:mm_cache_queries` (Counter) - (For multimodal models) Number of multimodal cache queries.
32+
- `vllm:mm_cache_hits` (Counter) - (For multimodal models) Number of multimodal cache hits.
33+
- `vllm:num_preemptions_total` (Counter) - Number of preemptions.
34+
- `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed.
35+
- `vllm:generation_tokens_total` (Counter) - Total number of generated tokens.
36+
- `vllm:iteration_tokens_total` (Histogram) - Histogram of tokens processed in each engine step.
37+
- `vllm:cache_config_info` (Gauge) - Information about the cache configuration.
38+
- `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason).
39+
- `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts.
40+
- `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts.
41+
- `vllm:request_params_n` (Histogram) - Histogram of request parameter n.
42+
- `vllm:request_params_max_tokens` - (Histogram) - Histogram of max_tokens parameter in requests.
43+
- `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT).
44+
- `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency.
45+
- `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency.
46+
- `vllm:request_queue_time_seconds` (Histogram) - Time spent in the queue.
47+
- `vllm:request_inference_time_seconds` (Histogram) - Request inference time.
48+
- `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time.
49+
- `vllm:request_decode_time_seconds` (Histogram) - Request decode time.
5950

6051
These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).
6152

@@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul
8677

8778
Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
8879

89-
With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
80+
During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
9081

9182
```bash
9283
$ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
9990

10091
### Multi-process Mode
10192

102-
In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
93+
Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
94+
95+
More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See <https://github.com/vllm-project/vllm/pull/17546> and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
10396

10497
### Built in Python/Process Metrics
10598

@@ -116,29 +109,25 @@ The following metrics are supported by default by `prometheus_client`, but they
116109
- `process_open_fds`
117110
- `process_max_fds`
118111

119-
This is relevant because if we move away from multiprocess mode in v1,
120-
we get these back. However, it's questionable how relevant these are
121-
if they don't aggregate these stats for all processes that make up a
122-
vLLM instance.
112+
Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance.
113+
114+
## Metrics Design
123115

124-
### v0 PRs and Issues
116+
The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
125117

126-
For background, these are some of the relevant PRs which added the v0 metrics:
118+
### Legacy PRs
119+
120+
To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics:
127121

128122
- <https://github.com/vllm-project/vllm/pull/1890>
129123
- <https://github.com/vllm-project/vllm/pull/2316>
130124
- <https://github.com/vllm-project/vllm/pull/2730>
131125
- <https://github.com/vllm-project/vllm/pull/4464>
132126
- <https://github.com/vllm-project/vllm/pull/7279>
133127

134-
Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
135-
136-
## v1 Design
128+
### Metrics Implementation PRs
137129

138-
### v1 PRs
139-
140-
For background, here are the relevant v1 PRs relating to the v1
141-
metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
130+
For background, here are the relevant PRs relating to the metrics implementation <https://github.com/vllm-project/vllm/issues/10582>:
142131

143132
- <https://github.com/vllm-project/vllm/pull/11962>
144133
- <https://github.com/vllm-project/vllm/pull/11973>
@@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
369358

370359
However, `prometheus_client` has
371360
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
372-
for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
361+
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
373362
simply use a `Gauge` metric set to 1 and
374363
`multiprocess_mode="mostrecent"` instead.
375364

@@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes.
396385

397386
This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
398387
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
399-
If we revisit this design and deprecate the old metric, we should reduce
400-
the need for a significant deprecation period by making the change in
401-
v0 also and asking this project to move to the new metric.
388+
If we revisit this design and deprecate the old metric, we should
389+
coordinate with downstream users so they can migrate before the removal.
402390

403391
### Prefix Cache metrics
404392

@@ -478,22 +466,20 @@ us with:
478466

479467
```python
480468
if seq_group.is_finished():
481-
if (
482-
seq_group.metrics.first_scheduled_time is not None
483-
and seq_group.metrics.first_token_time is not None
484-
):
469+
if (seq_group.metrics.first_scheduled_time is not None and
470+
seq_group.metrics.first_token_time is not None):
485471
time_queue_requests.append(
486472
seq_group.metrics.first_scheduled_time -
487-
seq_group.metrics.arrival_time
488-
)
473+
seq_group.metrics.arrival_time)
489474
...
490475
if seq_group.metrics.time_in_queue is not None:
491-
time_in_queue_requests.append(seq_group.metrics.time_in_queue)
476+
time_in_queue_requests.append(
477+
seq_group.metrics.time_in_queue)
492478
```
493479

494480
This seems duplicative, and one of them should be removed. The latter
495481
is used by the Grafana dashboard, so we should deprecate or remove the
496-
former from v0.
482+
former.
497483

498484
### Prefix Cache Hit Rate
499485

@@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a
502488

503489
### KV Cache Offloading
504490

505-
Two v0 metrics relate to a "swapped" preemption mode that is no
491+
Two legacy metrics relate to a "swapped" preemption mode that is no
506492
longer relevant in v1:
507493

508494
- `vllm:num_requests_swapped`
@@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
513499
memory. This is also known as "KV cache offloading" and is configured
514500
with `--swap-space` and `--preemption-mode`.
515501

516-
In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
502+
Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
517503
SequenceGroup encapsulated the idea of N Sequences which
518504
all shared the same prompt kv blocks. This enabled KV cache block
519505
sharing between requests, and copy-on-write to do branching. CPU
@@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed.
526512

527513
SequenceGroup was removed in V1, although a replacement will be
528514
required for "parallel sampling" (`n>1`).
529-
[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
515+
[Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a
530516
lot of complex code for a very uncommon feature.
531517

532518
In V1, with prefix caching being better (zero over head) and therefore
@@ -537,7 +523,7 @@ better.
537523

538524
### Parallel Sampling
539525

540-
Some v0 metrics are only relevant in the context of "parallel
526+
Some legacy metrics are only relevant in the context of "parallel
541527
sampling". This is where the `n` parameter in a request is used to
542528
request multiple completions from the same prompt.
543529

@@ -556,7 +542,7 @@ also add these metrics.
556542

557543
### Speculative Decoding
558544

559-
Some v0 metrics are specific to "speculative decoding". This is where
545+
Some legacy metrics are specific to "speculative decoding". This is where
560546
we generate candidate tokens using a faster, approximate method or
561547
model and then validate those tokens with the larger model.
562548

@@ -568,7 +554,7 @@ model and then validate those tokens with the larger model.
568554

569555
There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
570556
speculative decoding to v1. Other techniques will follow. We should
571-
revisit the v0 metrics in this context.
557+
revisit these metrics in this context.
572558

573559
!!! note
574560
We should probably expose acceptance rate as separate accepted
@@ -641,7 +627,7 @@ metrics are often relatively straightforward to add:
641627
metrics are usually of very limited use unless they can be enabled
642628
by default and in production.
643629
3. They have an impact on development and maintenance of the
644-
project. Every metric added to v0 has made this v1 effort more
630+
project. Every metric added over time has made this effort more
645631
time-consuming, and perhaps not all metrics justify this ongoing
646632
investment in their maintenance.
647633

@@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual
652638
requests as they move through different services and components. Both
653639
fall under the more general heading of "Observability".
654640

655-
v0 has support for OpenTelemetry tracing:
641+
vLLM has support for OpenTelemetry tracing:
656642

657-
- Added by <https://github.com/vllm-project/vllm/pull/4687>
643+
- Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372>
658644
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
659645
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
660646
- [User-facing docs](../examples/online_serving/opentelemetry.md)
661647
- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
662648
- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
663-
649+
664650
OpenTelemetry has a
665651
[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
666652

667-
Since metrics is a big enough topic on its own, we are going to tackle
668-
the topic of tracing in v1 separately.
653+
Since metrics is a big enough topic on its own, we consider the topic
654+
of tracing to be quite separate from metrics.
669655

670656
### OpenTelemetry Model Forward vs Execute Time
671657

672-
In v0, we have the following two metrics:
658+
The current implementation exposes the following two metrics:
673659

674660
- `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
675661
in the model forward pass when this request was in the batch.

0 commit comments

Comments
 (0)