11# Metrics
22
3- Ensure the v1 LLM Engine exposes a superset of the metrics available in v0 .
3+ vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine .
44
55## Objectives
66
7- - Achieve parity of metrics between v0 and v1 .
8- - The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
9- - Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
7+ - Provide comprehensive coverage of engine and request level metrics to aid production monitoring .
8+ - Prioritize Prometheus integrations , as this is what we expect to be used in production environments.
9+ - Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.
1010
1111## Background
1212
@@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows:
1717
1818The mental model is that server-level metrics help explain the values of request-level metrics.
1919
20- ### v0 Metrics
21-
22- In v0, the following metrics are exposed via a Prometheus-compatible ` /metrics ` endpoint using the ` vllm: ` prefix:
23-
24- - ` vllm:num_requests_running ` (Gauge)
25- - ` vllm:num_requests_swapped ` (Gauge)
26- - ` vllm:num_requests_waiting ` (Gauge)
27- - ` vllm:gpu_cache_usage_perc ` (Gauge)
28- - ` vllm:cpu_cache_usage_perc ` (Gauge)
29- - ` vllm:gpu_prefix_cache_hit_rate ` (Gauge)
30- - ` vllm:cpu_prefix_cache_hit_rate ` (Gauge)
31- - ` vllm:prompt_tokens_total ` (Counter)
32- - ` vllm:generation_tokens_total ` (Counter)
33- - ` vllm:request_success_total ` (Counter)
34- - ` vllm:request_prompt_tokens ` (Histogram)
35- - ` vllm:request_generation_tokens ` (Histogram)
36- - ` vllm:time_to_first_token_seconds ` (Histogram)
37- - ` vllm:time_per_output_token_seconds ` (Histogram)
38- - ` vllm:e2e_request_latency_seconds ` (Histogram)
39- - ` vllm:request_queue_time_seconds ` (Histogram)
40- - ` vllm:request_inference_time_seconds ` (Histogram)
41- - ` vllm:request_prefill_time_seconds ` (Histogram)
42- - ` vllm:request_decode_time_seconds ` (Histogram)
43- - ` vllm:request_max_num_generation_tokens ` (Histogram)
44- - ` vllm:num_preemptions_total ` (Counter)
45- - ` vllm:cache_config_info ` (Gauge)
46- - ` vllm:lora_requests_info ` (Gauge)
47- - ` vllm:tokens_total ` (Counter)
48- - ` vllm:iteration_tokens_total ` (Histogram)
49- - ` vllm:time_in_queue_requests ` (Histogram)
50- - ` vllm:model_forward_time_milliseconds ` (Histogram)
51- - ` vllm:model_execute_time_milliseconds ` (Histogram)
52- - ` vllm:request_params_n ` (Histogram)
53- - ` vllm:request_params_max_tokens ` (Histogram)
54- - ` vllm:spec_decode_draft_acceptance_rate ` (Gauge)
55- - ` vllm:spec_decode_efficiency ` (Gauge)
56- - ` vllm:spec_decode_num_accepted_tokens_total ` (Counter)
57- - ` vllm:spec_decode_num_draft_tokens_total ` (Counter)
58- - ` vllm:spec_decode_num_emitted_tokens_total ` (Counter)
20+ ### Metrics Overview
21+
22+ ### v1 Metrics
23+
24+ In v1, the following metrics are exposed via a Prometheus-compatible ` /metrics ` endpoint using the ` vllm: ` prefix:
25+
26+ - ` vllm:num_requests_running ` (Gauge) - Number of requests currently running.
27+ - ` vllm:num_requests_waiting ` (Gauge) - Number of requests currently waiting.
28+ - ` vllm:kv_cache_usage_perc ` (Gauge) - Fraction of used KV cache blocks (0–1).
29+ - ` vllm:prefix_cache_queries ` (Counter) - Number of prefix cache queries.
30+ - ` vllm:prefix_cache_hits ` (Counter) - Number of prefix cache hits.
31+ - ` vllm:mm_cache_queries ` (Counter) - (For multimodal models) Number of multimodal cache queries.
32+ - ` vllm:mm_cache_hits ` (Counter) - (For multimodal models) Number of multimodal cache hits.
33+ - ` vllm:num_preemptions_total ` (Counter) - Number of preemptions.
34+ - ` vllm:prompt_tokens_total ` (Counter) - Total number of prompt tokens processed.
35+ - ` vllm:generation_tokens_total ` (Counter) - Total number of generated tokens.
36+ - ` vllm:iteration_tokens_total ` (Histogram) - Histogram of tokens processed in each engine step.
37+ - ` vllm:cache_config_info ` (Gauge) - Information about the cache configuration.
38+ - ` vllm:request_success_total ` (Counter) - Number of finished requests (by finish reason).
39+ - ` vllm:request_prompt_tokens ` (Histogram) - Histogram of input prompt token counts.
40+ - ` vllm:request_generation_tokens ` (Histogram) - Histogram of generation token counts.
41+ - ` vllm:request_params_n ` (Histogram) - Histogram of request parameter n.
42+ - ` vllm:request_params_max_tokens ` - (Histogram) - Histogram of max_tokens parameter in requests.
43+ - ` vllm:time_to_first_token_seconds ` (Histogram) - Time to first token (TTFT).
44+ - ` vllm:inter_token_latency_seconds ` (Histogram) - Inter-token latency.
45+ - ` vllm:e2e_request_latency_seconds ` (Histogram) - End-to-end request latency.
46+ - ` vllm:request_queue_time_seconds ` (Histogram) - Time spent in the queue.
47+ - ` vllm:request_inference_time_seconds ` (Histogram) - Request inference time.
48+ - ` vllm:request_prefill_time_seconds ` (Histogram) - Request prefill time.
49+ - ` vllm:request_decode_time_seconds ` (Histogram) - Request decode time.
5950
6051These are documented under [ Inferencing and Serving -> Production Metrics] ( ../usage/metrics.md ) .
6152
@@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul
8677
8778Prometheus support was initially added [ using the aioprometheus library] ( https://github.com/vllm-project/vllm/pull/1890 ) , but a switch was made quickly to [ prometheus_client] ( https://github.com/vllm-project/vllm/pull/2730 ) . The rationale is discussed in both linked PRs.
8879
89- With the switch to ` aioprometheus ` , we lost a ` MetricsMiddleware ` to track HTTP metrics, but this was reinstated [ using prometheus_fastapi_instrumentator] ( https://github.com/vllm-project/vllm/pull/15657 ) :
80+ During those migrations we briefly lost a ` MetricsMiddleware ` to track HTTP metrics, but this was reinstated [ using prometheus_fastapi_instrumentator] ( https://github.com/vllm-project/vllm/pull/15657 ) :
9081
9182``` bash
9283$ curl http://0.0.0.0:8000/metrics 2> /dev/null | grep -P ' ^http_(?!.*(_bucket|_created|_sum)).*'
@@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
9990
10091### Multi-process Mode
10192
102- In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See < https://github.com/vllm-project/vllm/pull/7279 > .
93+ Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See < https://github.com/vllm-project/vllm/pull/7279 > .
94+
95+ More recently, metrics are collected in the API server process and multiprocess mode is only used when ` --api-server-count > 1 ` . See < https://github.com/vllm-project/vllm/pull/17546 > and details on [ API server scale-out] ( ../serving/data_parallel_deployment.md#internal-load-balancing ) .
10396
10497### Built in Python/Process Metrics
10598
@@ -116,29 +109,25 @@ The following metrics are supported by default by `prometheus_client`, but they
116109- ` process_open_fds `
117110- ` process_max_fds `
118111
119- This is relevant because if we move away from multiprocess mode in v1,
120- we get these back. However, it's questionable how relevant these are
121- if they don't aggregate these stats for all processes that make up a
122- vLLM instance.
112+ Therefore, these metrics are unavailable when ` --api-server-count > 1 ` . It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance.
113+
114+ ## Metrics Design
123115
124- ### v0 PRs and Issues
116+ The [ "Even Better Observability" ] ( https://github.com/vllm-project/vllm/issues/3616 ) feature where was where much of the metrics design was planned. For example, see where [ a detailed roadmap was laid out ] ( https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781 ) .
125117
126- For background, these are some of the relevant PRs which added the v0 metrics:
118+ ### Legacy PRs
119+
120+ To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics:
127121
128122- < https://github.com/vllm-project/vllm/pull/1890 >
129123- < https://github.com/vllm-project/vllm/pull/2316 >
130124- < https://github.com/vllm-project/vllm/pull/2730 >
131125- < https://github.com/vllm-project/vllm/pull/4464 >
132126- < https://github.com/vllm-project/vllm/pull/7279 >
133127
134- Also note the [ "Even Better Observability"] ( https://github.com/vllm-project/vllm/issues/3616 ) feature where e.g. [ a detailed roadmap was laid out] ( https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781 ) .
135-
136- ## v1 Design
128+ ### Metrics Implementation PRs
137129
138- ### v1 PRs
139-
140- For background, here are the relevant v1 PRs relating to the v1
141- metrics issue < https://github.com/vllm-project/vllm/issues/10582 > :
130+ For background, here are the relevant PRs relating to the metrics implementation < https://github.com/vllm-project/vllm/issues/10582 > :
142131
143132- < https://github.com/vllm-project/vllm/pull/11962 >
144133- < https://github.com/vllm-project/vllm/pull/11973 >
@@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
369358
370359However, ` prometheus_client ` has
371360[ never supported Info metrics in multiprocessing mode] ( https://github.com/prometheus/client_python/pull/300 ) -
372- for [ unclear reasons] ( https://github.com/vllm-project/vllm/pull/ 7279#discussion_r1710417152) . We
361+ for [ unclear reasons] ( gh-pr: 7279#discussion_r1710417152) . We
373362simply use a ` Gauge ` metric set to 1 and
374363` multiprocess_mode="mostrecent" ` instead.
375364
@@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes.
396385
397386This was added in < https://github.com/vllm-project/vllm/pull/9477 > and there is
398387[ at least one known user] ( https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54 ) .
399- If we revisit this design and deprecate the old metric, we should reduce
400- the need for a significant deprecation period by making the change in
401- v0 also and asking this project to move to the new metric.
388+ If we revisit this design and deprecate the old metric, we should
389+ coordinate with downstream users so they can migrate before the removal.
402390
403391### Prefix Cache metrics
404392
@@ -478,22 +466,20 @@ us with:
478466
479467``` python
480468if seq_group.is_finished():
481- if (
482- seq_group.metrics.first_scheduled_time is not None
483- and seq_group.metrics.first_token_time is not None
484- ):
469+ if (seq_group.metrics.first_scheduled_time is not None and
470+ seq_group.metrics.first_token_time is not None ):
485471 time_queue_requests.append(
486472 seq_group.metrics.first_scheduled_time -
487- seq_group.metrics.arrival_time
488- )
473+ seq_group.metrics.arrival_time)
489474 ...
490475 if seq_group.metrics.time_in_queue is not None :
491- time_in_queue_requests.append(seq_group.metrics.time_in_queue)
476+ time_in_queue_requests.append(
477+ seq_group.metrics.time_in_queue)
492478```
493479
494480This seems duplicative, and one of them should be removed. The latter
495481is used by the Grafana dashboard, so we should deprecate or remove the
496- former from v0 .
482+ former.
497483
498484### Prefix Cache Hit Rate
499485
@@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a
502488
503489### KV Cache Offloading
504490
505- Two v0 metrics relate to a "swapped" preemption mode that is no
491+ Two legacy metrics relate to a "swapped" preemption mode that is no
506492longer relevant in v1:
507493
508494- ` vllm:num_requests_swapped `
@@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
513499memory. This is also known as "KV cache offloading" and is configured
514500with ` --swap-space ` and ` --preemption-mode ` .
515501
516- In v0 , [ vLLM has long supported beam search] ( https://github.com/vllm-project/vllm/issues/6226 ) . The
502+ Historically , [ vLLM has long supported beam search] ( https://github.com/vllm-project/vllm/issues/6226 ) . The
517503SequenceGroup encapsulated the idea of N Sequences which
518504all shared the same prompt kv blocks. This enabled KV cache block
519505sharing between requests, and copy-on-write to do branching. CPU
@@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed.
526512
527513SequenceGroup was removed in V1, although a replacement will be
528514required for "parallel sampling" (` n>1 ` ).
529- [ Beam search was moved out of the core (in V0) ] ( https://github.com/vllm-project/vllm/issues/8306 ) . There was a
515+ [ Beam search was moved out of the core] ( https://github.com/vllm-project/vllm/issues/8306 ) . There was a
530516lot of complex code for a very uncommon feature.
531517
532518In V1, with prefix caching being better (zero over head) and therefore
@@ -537,7 +523,7 @@ better.
537523
538524### Parallel Sampling
539525
540- Some v0 metrics are only relevant in the context of "parallel
526+ Some legacy metrics are only relevant in the context of "parallel
541527sampling". This is where the ` n ` parameter in a request is used to
542528request multiple completions from the same prompt.
543529
@@ -556,7 +542,7 @@ also add these metrics.
556542
557543### Speculative Decoding
558544
559- Some v0 metrics are specific to "speculative decoding". This is where
545+ Some legacy metrics are specific to "speculative decoding". This is where
560546we generate candidate tokens using a faster, approximate method or
561547model and then validate those tokens with the larger model.
562548
@@ -568,7 +554,7 @@ model and then validate those tokens with the larger model.
568554
569555There is a PR under review (< https://github.com/vllm-project/vllm/pull/12193 > ) to add "prompt lookup (ngram)"
570556speculative decoding to v1. Other techniques will follow. We should
571- revisit the v0 metrics in this context.
557+ revisit these metrics in this context.
572558
573559!!! note
574560 We should probably expose acceptance rate as separate accepted
@@ -641,7 +627,7 @@ metrics are often relatively straightforward to add:
641627 metrics are usually of very limited use unless they can be enabled
642628 by default and in production.
6436293 . They have an impact on development and maintenance of the
644- project. Every metric added to v0 has made this v1 effort more
630+ project. Every metric added over time has made this effort more
645631 time-consuming, and perhaps not all metrics justify this ongoing
646632 investment in their maintenance.
647633
@@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual
652638requests as they move through different services and components. Both
653639fall under the more general heading of "Observability".
654640
655- v0 has support for OpenTelemetry tracing:
641+ vLLM has support for OpenTelemetry tracing:
656642
657- - Added by < https://github.com/vllm-project/vllm/pull/4687 >
643+ - Added by < https://github.com/vllm-project/vllm/pull/4687 > and reinstated by < https://github.com/vllm-project/vllm/pull/20372 >
658644- Configured with ` --oltp-traces-endpoint ` and ` --collect-detailed-traces `
659645- [ OpenTelemetry blog post] ( https://opentelemetry.io/blog/2024/llm-observability/ )
660646- [ User-facing docs] ( ../examples/online_serving/opentelemetry.md )
661647- [ Blog post] ( https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f )
662648- [ IBM product docs] ( https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview )
663-
649+
664650OpenTelemetry has a
665651[ Gen AI Working Group] ( https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md ) .
666652
667- Since metrics is a big enough topic on its own, we are going to tackle
668- the topic of tracing in v1 separately .
653+ Since metrics is a big enough topic on its own, we consider the topic
654+ of tracing to be quite separate from metrics .
669655
670656### OpenTelemetry Model Forward vs Execute Time
671657
672- In v0, we have the following two metrics:
658+ The current implementation exposes the following two metrics:
673659
674660- ` vllm:model_forward_time_milliseconds ` (Histogram) - The time spent
675661 in the model forward pass when this request was in the batch.
0 commit comments