[docs] Update v1 metrics design doc (vllm-project#27332)

markmc · simon-mo · atalhens · albertoperdomo2 · commit 000ec03b8e93 · 2025-10-23T21:17:37.000+01:00
Signed-off-by: Simon Mo &lt;simon.mo@hey.com&gt;
Signed-off-by: Mark McLoughlin &lt;markmc@redhat.com&gt;
Signed-off-by: atalhens &lt;sneh.lata@nutanix.com&gt;
Co-authored-by: Simon Mo &lt;simon.mo@hey.com&gt;
Co-authored-by: atalhens &lt;sneh.lata@nutanix.com&gt;
Signed-off-by: Alberto Perdomo &lt;aperdomo@redhat.com&gt;
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
@@ -1,12 +1,12 @@
 # Metrics
 
-Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
+vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.
 
 ## Objectives
 
-- Achieve parity of metrics between v0 and v1.
-- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
-- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
+- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
+- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
+- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.
 
 ## Background
 
@@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows:
 
 The mental model is that server-level metrics help explain the values of request-level metrics.
 
-### v0 Metrics
-
-In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
-
-- `vllm:num_requests_running` (Gauge)
-- `vllm:num_requests_swapped` (Gauge)
-- `vllm:num_requests_waiting` (Gauge)
-- `vllm:gpu_cache_usage_perc` (Gauge)
-- `vllm:cpu_cache_usage_perc` (Gauge)
-- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
-- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
-- `vllm:prompt_tokens_total` (Counter)
-- `vllm:generation_tokens_total` (Counter)
-- `vllm:request_success_total` (Counter)
-- `vllm:request_prompt_tokens` (Histogram)
-- `vllm:request_generation_tokens` (Histogram)
-- `vllm:time_to_first_token_seconds` (Histogram)
-- `vllm:time_per_output_token_seconds` (Histogram)
-- `vllm:e2e_request_latency_seconds` (Histogram)
-- `vllm:request_queue_time_seconds` (Histogram)
-- `vllm:request_inference_time_seconds` (Histogram)
-- `vllm:request_prefill_time_seconds` (Histogram)
-- `vllm:request_decode_time_seconds` (Histogram)
-- `vllm:request_max_num_generation_tokens` (Histogram)
-- `vllm:num_preemptions_total` (Counter)
-- `vllm:cache_config_info` (Gauge)
-- `vllm:lora_requests_info` (Gauge)
-- `vllm:tokens_total` (Counter)
-- `vllm:iteration_tokens_total` (Histogram)
-- `vllm:time_in_queue_requests` (Histogram)
-- `vllm:model_forward_time_milliseconds` (Histogram)
-- `vllm:model_execute_time_milliseconds` (Histogram)
-- `vllm:request_params_n` (Histogram)
-- `vllm:request_params_max_tokens` (Histogram)
-- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
-- `vllm:spec_decode_efficiency` (Gauge)
-- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
-- `vllm:spec_decode_num_draft_tokens_total` (Counter)
-- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
+### Metrics Overview
+
+### v1 Metrics
+
+In v1, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
+
+- `vllm:num_requests_running` (Gauge) - Number of requests currently running.
+- `vllm:num_requests_waiting` (Gauge) - Number of requests currently waiting.
+- `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1).
+- `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries.
+- `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits.
+- `vllm:mm_cache_queries` (Counter) - (For multimodal models) Number of multimodal cache queries.
+- `vllm:mm_cache_hits` (Counter) - (For multimodal models) Number of multimodal cache hits.
+- `vllm:num_preemptions_total` (Counter) - Number of preemptions.
+- `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed.
+- `vllm:generation_tokens_total` (Counter) - Total number of generated tokens.
+- `vllm:iteration_tokens_total` (Histogram) - Histogram of tokens processed in each engine step.
+- `vllm:cache_config_info` (Gauge) - Information about the cache configuration.
+- `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason).
+- `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts.
+- `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts.
+- `vllm:request_params_n` (Histogram) - Histogram of request parameter n.
+- `vllm:request_params_max_tokens` - (Histogram) - Histogram of max_tokens parameter in requests.
+- `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT).
+- `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency.
+- `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency.
+- `vllm:request_queue_time_seconds` (Histogram) - Time spent in the queue.
+- `vllm:request_inference_time_seconds` (Histogram) - Request inference time.
+- `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time.
+- `vllm:request_decode_time_seconds` (Histogram) - Request decode time.
 
 These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).
 
@@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul
 
 Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
 
-With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
+During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
 
 ```bash
 $ curl http://0.0.0.0:8000/metrics 2>/dev/null  | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
 
 ### Multi-process Mode
 
-In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
+Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
+
+More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See <https://github.com/vllm-project/vllm/pull/17546> and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
 
 ### Built in Python/Process Metrics
 
@@ -116,29 +109,25 @@ The following metrics are supported by default by `prometheus_client`, but they
 - `process_open_fds`
 - `process_max_fds`
 
-This is relevant because if we move away from multiprocess mode in v1,
-we get these back. However, it's questionable how relevant these are
-if they don't aggregate these stats for all processes that make up a
-vLLM instance.
+Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance.
+
+## Metrics Design
 
-### v0 PRs and Issues
+The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
 
-For background, these are some of the relevant PRs which added the v0 metrics:
+### Legacy PRs
+
+To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics:
 
 - <https://github.com/vllm-project/vllm/pull/1890>
 - <https://github.com/vllm-project/vllm/pull/2316>
 - <https://github.com/vllm-project/vllm/pull/2730>
 - <https://github.com/vllm-project/vllm/pull/4464>
 - <https://github.com/vllm-project/vllm/pull/7279>
 
-Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
-
-## v1 Design
+### Metrics Implementation PRs
 
-### v1 PRs
-
-For background, here are the relevant v1 PRs relating to the v1
-metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
+For background, here are the relevant PRs relating to the metrics implementation <https://github.com/vllm-project/vllm/issues/10582>:
 
 - <https://github.com/vllm-project/vllm/pull/11962>
 - <https://github.com/vllm-project/vllm/pull/11973>
@@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
 
 However, `prometheus_client` has
 [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
-for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
+for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
 simply use a `Gauge` metric set to 1 and
 `multiprocess_mode="mostrecent"` instead.
 
@@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes.
 
 This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
 [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
-If we revisit this design and deprecate the old metric, we should reduce
-the need for a significant deprecation period by making the change in
-v0 also and asking this project to move to the new metric.
+If we revisit this design and deprecate the old metric, we should
+coordinate with downstream users so they can migrate before the removal.
 
 ### Prefix Cache metrics
 
@@ -478,22 +466,20 @@ us with:
 
 ```python
 if seq_group.is_finished():
-    if (
-        seq_group.metrics.first_scheduled_time is not None
-        and seq_group.metrics.first_token_time is not None
-    ):
+    if (seq_group.metrics.first_scheduled_time is not None and
+            seq_group.metrics.first_token_time is not None):
         time_queue_requests.append(
             seq_group.metrics.first_scheduled_time -
-            seq_group.metrics.arrival_time
-        )
+            seq_group.metrics.arrival_time)
     ...
     if seq_group.metrics.time_in_queue is not None:
-        time_in_queue_requests.append(seq_group.metrics.time_in_queue)
+        time_in_queue_requests.append(
+            seq_group.metrics.time_in_queue)
 ```
 
 This seems duplicative, and one of them should be removed. The latter
 is used by the Grafana dashboard, so we should deprecate or remove the
-former from v0.
+former.
 
 ### Prefix Cache Hit Rate
 
@@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a
 
 ### KV Cache Offloading
 
-Two v0 metrics relate to a "swapped" preemption mode that is no
+Two legacy metrics relate to a "swapped" preemption mode that is no
 longer relevant in v1:
 
 - `vllm:num_requests_swapped`
@@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
 memory. This is also known as "KV cache offloading" and is configured
 with `--swap-space` and `--preemption-mode`.
 
-In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
+Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
 SequenceGroup encapsulated the idea of N Sequences which
 all shared the same prompt kv blocks. This enabled KV cache block
 sharing between requests, and copy-on-write to do branching. CPU
@@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed.
 
 SequenceGroup was removed in V1, although a replacement will be
 required for "parallel sampling" (`n>1`).
-[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
+[Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a
 lot of complex code for a very uncommon feature.
 
 In V1, with prefix caching being better (zero over head) and therefore
@@ -537,7 +523,7 @@ better.
 
 ### Parallel Sampling
 
-Some v0 metrics are only relevant in the context of "parallel
+Some legacy metrics are only relevant in the context of "parallel
 sampling". This is where the `n` parameter in a request is used to
 request multiple completions from the same prompt.
 
@@ -556,7 +542,7 @@ also add these metrics.
 
 ### Speculative Decoding
 
-Some v0 metrics are specific to "speculative decoding". This is where
+Some legacy metrics are specific to "speculative decoding". This is where
 we generate candidate tokens using a faster, approximate method or
 model and then validate those tokens with the larger model.
 
@@ -568,7 +554,7 @@ model and then validate those tokens with the larger model.
 
 There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
 speculative decoding to v1. Other techniques will follow. We should
-revisit the v0 metrics in this context.
+revisit these metrics in this context.
 
 !!! note
     We should probably expose acceptance rate as separate accepted
@@ -641,7 +627,7 @@ metrics are often relatively straightforward to add:
    metrics are usually of very limited use unless they can be enabled
    by default and in production.
 3. They have an impact on development and maintenance of the
-   project. Every metric added to v0 has made this v1 effort more
+   project. Every metric added over time has made this effort more
    time-consuming, and perhaps not all metrics justify this ongoing
    investment in their maintenance.
 
@@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual
 requests as they move through different services and components. Both
 fall under the more general heading of "Observability".
 
-v0 has support for OpenTelemetry tracing:
+vLLM has support for OpenTelemetry tracing:
 
-- Added by <https://github.com/vllm-project/vllm/pull/4687>
+- Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372>
 - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
 - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
 - [User-facing docs](../examples/online_serving/opentelemetry.md)
 - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
 - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
-  
+
 OpenTelemetry has a
 [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
 
-Since metrics is a big enough topic on its own, we are going to tackle
-the topic of tracing in v1 separately.
+Since metrics is a big enough topic on its own, we consider the topic
+of tracing to be quite separate from metrics.
 
 ### OpenTelemetry Model Forward vs Execute Time
 
-In v0, we have the following two metrics:
+The current implementation exposes the following two metrics:
 
 - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
   in the model forward pass when this request was in the batch.