Skip to content

Conversation

Chasingdreams6
Copy link

@Chasingdreams6 Chasingdreams6 commented Aug 5, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

PR Descrption

Nowadays, cloud providers typically use a unified serving engine deployed on GPUs to serve all request types (text, image, file, agentcalls, etc.) for better resource utilization. However, the mean response time of these workloads is different, causing KVCache reuse time differences. For example, humans respond faster when they process image/audio data than to the complex text or file results generated by the LLM. Based on our analysis of realworld LLM traffic from top cloud provider Aliyun Bailian, we found that the general kvcache eviction policy (like LRU) for KVCache may not be optimal.

This PR provides a new feature, the WorkloadAware KVCache policy (WA), enhancing the 'FreeKVCacheBlockQueue' data structure to 'WorkloadAwareFreeKVCacheBlockQueue'. This leverages the extra information (i.e., workload type) corresponding to each KVCache block's request to perform better cache eviction than the default LRU policy used by FreeKVCacheBlockQueue.

This PR introduces a new optional parameter for each request, type_info, which contains the workload type of the request set by the frontend client. For example, a client can set a request's workload type as text_1, meaning this request is the first turn of a chat catalog, or file_2 meaning the request is the second turn of file analysis. Using this workload tag, cloud providers can classify requests from different business scenarios and guide the vLLM engine to do cache eviction.

Note that the WA policy can be applied beyond the traces from Aliyun Bailian. The WA policy can be useful in any deployment where one vLLM serving engine serves multiple frontend workloads (Chat, Multimodal, Reasoning, etc.). As soon as the client provides the workload tag in the request, the WA policy can leverage this to perform better cache eviction than LRU.

More details analysis about the production trace and the formula about our probability prediction model can be found in our paper  (Appeared at USENIX ATC '25).

Test Plan

We evaluate the effectiveness of WA policy on 7B and 70B model in different GPU cache space.

Setup

  • Model: Qwen/Qwen2.5-7B-Instruct, meta/Llama-3.3-70B-Instruct

  • GPU: 1~4 x Nvidia A800 80GB, TP=4 when testing the 70B model.

  • Trace: Aliyun Bailian Trace

  • Qps: First hour 6qps, second hour 6ps.

  • Total elements: 43195

  • Average input length: 2337.99

  • Average output length: 430.34

Demo

The benchmark/benchmark_wa.py script demonstrates a basic implementation of the workload-aware policy's profiling and prediction workflow. This specially designed client simulates multi-turn dialogues by generating requests based on the previous turn's output.

The benchmark/profiler_utils.py module provides a cache simulator to profile KVCache reuse patterns across different workloads.

The Bailian Traces dataset contains a two-hour trace at 6 queries per second (QPS). We utilize the first hour's trace to:

  1. Profile KVCache reuse patterns for various workloads

  2. Generate and export a hyperparameter configuration file

Subsequently, we launch a vLLM engine that loads this hyperparameter file to serve the second hour's trace.

Additionally, benchmark_wa.py generates detailed metrics files for analyzing both Query Time to First Token (QTTFT) and Time Per Output Token (TPOT) performance.

Performance Improvement

Since KVCache hits primarily reduce Time to First Token (TTFT) latency, and PrefillDecoding (PD) disaggregation has become prevalent in modern cloud provider deployments, we tested the PrefillOnly component (representing the PrefillNode in PD disaggregation) using the 6 QPS trace data. These tests were conducted across varying GPU KVCache block allocations. The reported queued TTFT metric includes request queuing time, which is particularly critical for user experience evaluation.

Qwen 7B model

The max_num_batch_tokens is set as 16384 to improve the GPU utilization. The GPU memory utlization is 0.9. We use the hyperparamter 'num-gpu-blocks-override' to change the cache space.

num_gpu_blocks WA_mean_qttft LRU_mean_qttft QTTFT_Improvement (%) WA_hit_rate LRU_hit_rate Hit_Rate_Improvement (%)
1024 14016.4 22322.4 37.21 0.1381 0.1175 17.53
2048 13458.6 23545 42.84 0.1586 0.1281 23.81
3072 10594.5 21969.9 51.78 0.1753 0.1407 24.59
4096 8544.2 13710.8 37.68 0.1934 0.1566 23.5
5120 6003.9 10271.6 41.55 0.2054 0.1786 15.01
6144 5283.4 7877.8 32.93 0.2245 0.2068 8.56
7168 2945.9 4963 40.63 0.2392 0.2299 4.05
8192 2264.1 2280.6 0.72 0.256 0.2498 2.48

We can see that the WA policy can get the cache hit rate improvement from 2.5% to 24.6% than LRU, and reduce the qttft from 0.7% to 52% than LRU. WA policy is better when the cache space is relatively limited.

Llama 70B model

Since the system throughtput is 1~2 qps when inferencing the 70B model, we sample the second hour's 6qps trace to 2qps. we prove the ratio of different turns remains the same.

num_gpu_blocks WA_mean_qttft LRU_mean_qttft QTTFT_Improvement (%) WA_hit_rate LRU_hit_rate Hit_Rate_Improvement (%)
512 6948.15 9064.9 23.351 0.131199 0.109314 20.0207
1024 4231.16 7808.79 45.8154 0.166392 0.12963 28.3594
2048 3299.04 4589.6 28.1191 0.215587 0.201457 7.01393
3072 2672.74 2798.33 4.48785 0.261666 0.259961 0.655852

We can see that the WA policy can get the cache hit rate improvement from 0.7% to 28% than LRU, and reduce the qttft from 4.5% to 46% than LRU. 

(Optional) Documentation Update

The documentation of WA is at docs/features/workload_aware_policy.md. You can check the implementation details in it.

@mergify mergify bot added documentation Improvements or additions to documentation performance Performance-related issues v1 labels Aug 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workload-aware KVCache eviction policy, enhancing the cache eviction mechanism by leveraging workload type information. The changes include a new benchmark script (benchmark_wa.py) and a profiling utility (profiler_utils.py). The review identified potential ZeroDivisionError and incorrect rate parameter calculation issues, which have been addressed in the review comments.


metrics["extras"] = {
"total_hit_tokens": sum(hit_tokens),
"hit_rate": sum(hit_tokens) / sum(input_lengths),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line can cause a ZeroDivisionError if sum(input_lengths) is 0. This can happen if no requests were processed (chosens is empty), which would cause the benchmark to crash at the end and lose all results.

Suggested change
"hit_rate": sum(hit_tokens) / sum(input_lengths),
"hit_rate": sum(hit_tokens) / sum(input_lengths) if sum(input_lengths) > 0 else 0.0,

Comment on lines 124 to 128
params = expon.fit(data)
if params[1] == 0:
return 0
lambda_hat = 1 / params[1]
return lambda_hat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation returns a lambda of 0 when the scale parameter from expon.fit is 0. This typically happens when all data points in data are identical. A lambda of 0 implies an infinite mean reuse time, which is incorrect if the constant reuse time is a small positive number. This can lead to suboptimal hyperparameter generation for the WA policy.

A more correct approach is to calculate the mean of the data directly in this case and return its reciprocal as the rate parameter lambda.

Suggested change
params = expon.fit(data)
if params[1] == 0:
return 0
lambda_hat = 1 / params[1]
return lambda_hat
if params[1] == 0:
mean_reuse_time = np.mean(data)
return 1.0 / mean_reuse_time if mean_reuse_time > 0 else 1e9

Copy link

github-actions bot commented Aug 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
@mergify mergify bot removed the needs-rebase label Aug 12, 2025
Copy link

mergify bot commented Aug 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 15, 2025
@mergify mergify bot removed the needs-rebase label Aug 15, 2025
Copy link

mergify bot commented Aug 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 27, 2025
@mergify mergify bot removed the needs-rebase label Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation performance Performance-related issues v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants