[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy #22236

Chasingdreams6 · 2025-08-05T06:40:14Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

PR Descrption

Nowadays, cloud providers typically use a unified serving engine deployed on GPUs to serve all request types (text, image, file, agentcalls, etc.) for better resource utilization. However, the mean response time of these workloads is different, causing KVCache reuse time differences. For example, humans respond faster when they process image/audio data than to the complex text or file results generated by the LLM. Based on our analysis of realworld LLM traffic from top cloud provider Aliyun Bailian, we found that the general kvcache eviction policy (like LRU) for KVCache may not be optimal.

This PR provides a new feature, the WorkloadAware KVCache policy (WA), enhancing the 'FreeKVCacheBlockQueue' data structure to 'WorkloadAwareFreeKVCacheBlockQueue'. This leverages the extra information (i.e., workload type) corresponding to each KVCache block's request to perform better cache eviction than the default LRU policy used by FreeKVCacheBlockQueue.

This PR introduces a new optional parameter for each request, type_info, which contains the workload type of the request set by the frontend client. For example, a client can set a request's workload type as text_1, meaning this request is the first turn of a chat catalog, or file_2 meaning the request is the second turn of file analysis. Using this workload tag, cloud providers can classify requests from different business scenarios and guide the vLLM engine to do cache eviction.

Note that the WA policy can be applied beyond the traces from Aliyun Bailian. The WA policy can be useful in any deployment where one vLLM serving engine serves multiple frontend workloads (Chat, Multimodal, Reasoning, etc.). As soon as the client provides the workload tag in the request, the WA policy can leverage this to perform better cache eviction than LRU.

More details analysis about the production trace and the formula about our probability prediction model can be found in our paper (Appeared at USENIX ATC '25).

Test Plan

We evaluate the effectiveness of WA policy on 7B and 70B model in different GPU cache space.

Setup

Model: Qwen/Qwen2.5-7B-Instruct, meta/Llama-3.3-70B-Instruct
GPU: 1~4 x Nvidia A800 80GB, TP=4 when testing the 70B model.
Trace: Aliyun Bailian Trace
Qps: First hour 6qps, second hour 6ps.
Total elements: 43195
Average input length: 2337.99
Average output length: 430.34

Demo

The benchmark/benchmark_wa.py script demonstrates a basic implementation of the workload-aware policy's profiling and prediction workflow. This specially designed client simulates multi-turn dialogues by generating requests based on the previous turn's output.

The benchmark/profiler_utils.py module provides a cache simulator to profile KVCache reuse patterns across different workloads.

The Bailian Traces dataset contains a two-hour trace at 6 queries per second (QPS). We utilize the first hour's trace to:

Profile KVCache reuse patterns for various workloads
Generate and export a hyperparameter configuration file

Subsequently, we launch a vLLM engine that loads this hyperparameter file to serve the second hour's trace.

Additionally, benchmark_wa.py generates detailed metrics files for analyzing both Query Time to First Token (QTTFT) and Time Per Output Token (TPOT) performance.

Performance Improvement

Since KVCache hits primarily reduce Time to First Token (TTFT) latency, and PrefillDecoding (PD) disaggregation has become prevalent in modern cloud provider deployments, we tested the PrefillOnly component (representing the PrefillNode in PD disaggregation) using the 6 QPS trace data. These tests were conducted across varying GPU KVCache block allocations. The reported queued TTFT metric includes request queuing time, which is particularly critical for user experience evaluation.

Qwen 7B model

The max_num_batch_tokens is set as 16384 to improve the GPU utilization. The GPU memory utlization is 0.9. We use the hyperparamter 'num-gpu-blocks-override' to change the cache space.

num_gpu_blocks	WA_mean_qttft	LRU_mean_qttft	QTTFT_Improvement (%)	WA_hit_rate	LRU_hit_rate	Hit_Rate_Improvement (%)
1024	14016.4	22322.4	37.21	0.1381	0.1175	17.53
2048	13458.6	23545	42.84	0.1586	0.1281	23.81
3072	10594.5	21969.9	51.78	0.1753	0.1407	24.59
4096	8544.2	13710.8	37.68	0.1934	0.1566	23.5
5120	6003.9	10271.6	41.55	0.2054	0.1786	15.01
6144	5283.4	7877.8	32.93	0.2245	0.2068	8.56
7168	2945.9	4963	40.63	0.2392	0.2299	4.05
8192	2264.1	2280.6	0.72	0.256	0.2498	2.48

We can see that the WA policy can get the cache hit rate improvement from 2.5% to 24.6% than LRU, and reduce the qttft from 0.7% to 52% than LRU. WA policy is better when the cache space is relatively limited.

Llama 70B model

Since the system throughtput is 1~2 qps when inferencing the 70B model, we sample the second hour's 6qps trace to 2qps. we prove the ratio of different turns remains the same.

num_gpu_blocks	WA_mean_qttft	LRU_mean_qttft	QTTFT_Improvement (%)	WA_hit_rate	LRU_hit_rate	Hit_Rate_Improvement (%)
512	6948.15	9064.9	23.351	0.131199	0.109314	20.0207
1024	4231.16	7808.79	45.8154	0.166392	0.12963	28.3594
2048	3299.04	4589.6	28.1191	0.215587	0.201457	7.01393
3072	2672.74	2798.33	4.48785	0.261666	0.259961	0.655852

We can see that the WA policy can get the cache hit rate improvement from 0.7% to 28% than LRU, and reduce the qttft from 4.5% to 46% than LRU.

(Optional) Documentation Update

The documentation of WA is at docs/features/workload_aware_policy.md. You can check the implementation details in it.

Signed-off-by: Jinbo <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a workload-aware KVCache eviction policy, enhancing the cache eviction mechanism by leveraging workload type information. The changes include a new benchmark script (benchmark_wa.py) and a profiling utility (profiler_utils.py). The review identified potential ZeroDivisionError and incorrect rate parameter calculation issues, which have been addressed in the review comments.

gemini-code-assist · 2025-08-05T06:42:54Z

benchmarks/benchmark_wa.py

+
+        metrics["extras"] = {
+            "total_hit_tokens": sum(hit_tokens),
+            "hit_rate": sum(hit_tokens) / sum(input_lengths),


This line can cause a ZeroDivisionError if sum(input_lengths) is 0. This can happen if no requests were processed (chosens is empty), which would cause the benchmark to crash at the end and lose all results.

Suggested change

"hit_rate": sum(hit_tokens) / sum(input_lengths),

"hit_rate": sum(hit_tokens) / sum(input_lengths) if sum(input_lengths) > 0 else 0.0,

gemini-code-assist · 2025-08-05T06:42:55Z

benchmarks/profiler_utils.py

+            params = expon.fit(data)
+            if params[1] == 0:
+                return 0
+            lambda_hat = 1 / params[1]
+            return lambda_hat


The current implementation returns a lambda of 0 when the scale parameter from expon.fit is 0. This typically happens when all data points in data are identical. A lambda of 0 implies an infinite mean reuse time, which is incorrect if the constant reuse time is a small positive number. This can lead to suboptimal hyperparameter generation for the WA policy.

A more correct approach is to calculate the mean of the data directly in this case and return its reciprocal as the rate parameter lambda.

Suggested change

params = expon.fit(data)

if params[1] == 0:

return 0

lambda_hat = 1 / params[1]

return lambda_hat

if params[1] == 0:

mean_reuse_time = np.mean(data)

return 1.0 / mean_reuse_time if mean_reuse_time > 0 else 1e9

github-actions · 2025-08-05T07:15:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jinbo <[email protected]>

mergify · 2025-08-11T02:39:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

docs/features/workload_aware_policy.md

Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: kim <[email protected]>

Signed-off-by: Jinbo <[email protected]>

mergify · 2025-09-08T13:54:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jinbo <[email protected]>

pytorch-bot · 2025-09-15T08:35:18Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: Jinbo <[email protected]>

mergify · 2025-09-21T23:15:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jinbo <[email protected]>

…_policy_pr

hmellor · 2025-10-08T10:38:17Z

vllm/config/cache.py

+    enable_wa_policy: bool = False
+    """This feature enable workload-aware policy for KVcache pool
+    """
+    wa_offline_param_path: str = ""
+    """The offline parameter used by workload-aware policy"""
+


Why not just do

Suggested change

enable_wa_policy: bool = False

"""This feature enable workload-aware policy for KVcache pool

"""

wa_offline_param_path: str = ""

"""The offline parameter used by workload-aware policy"""

wa_offline_param_path: Optional[str] = None

"""The offline parameter used by workload-aware policy"""

and enable the feature if not None. It seems unnecessary to have to configs for this.

mergify · 2025-10-08T14:28:11Z

Documentation preview: https://vllm--22236.org.readthedocs.build/en/22236/

Signed-off-by: Jinbo <[email protected]>

mergify · 2025-10-10T08:49:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Chasingdreams6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jinbo added 2 commits August 5, 2025 13:56

[Perf][Core] workload-aware kvcache eviction policy

9dae084

Signed-off-by: Jinbo <[email protected]>

Merge branch 'main' into wa_policy_pr

01f82f2

Chasingdreams6 requested review from WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners August 5, 2025 06:40

mergify bot added documentation Improvements or additions to documentation performance Performance-related issues v1 labels Aug 5, 2025

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

[fix] fix div zero bugs, startup param bugs and refine docs

06fcc6f

Signed-off-by: Jinbo <[email protected]>

mergify bot added the needs-rebase label Aug 11, 2025

hmellor reviewed Aug 11, 2025

View reviewed changes

docs/features/workload_aware_policy.md Outdated Show resolved Hide resolved

Chasingdreams6 and others added 2 commits August 11, 2025 21:27

Update docs/features/workload_aware_policy.md

cdff3f9

Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: kim <[email protected]>

Merge branch 'main' into wa_policy_pr

c897490

Signed-off-by: Jinbo <[email protected]>

mergify bot removed the needs-rebase label Aug 12, 2025

Merge branch 'main' into wa_policy_pr

780553b

Signed-off-by: Jinbo <[email protected]>

Chasingdreams6 requested review from ProExpertProg and yewentao256 as code owners August 14, 2025 03:19

Merge branch 'main' into wa_policy_pr

8805a1e

mergify bot added the needs-rebase label Aug 27, 2025

Merge branch 'main' into wa_policy_pr

2e365c9

Signed-off-by: Jinbo <[email protected]>

mergify bot removed the needs-rebase label Sep 1, 2025

mergify bot added the needs-rebase label Sep 8, 2025

Merge branch 'main' into wa_policy_pr

7a25370

Signed-off-by: Jinbo <[email protected]>

Chasingdreams6 requested a review from heheda12345 as a code owner September 15, 2025 08:34

mergify bot added the ci/build label Sep 15, 2025

mergify bot removed the needs-rebase label Sep 15, 2025

Jinbo added 2 commits September 15, 2025 20:34

fix: format

80108ce

Signed-off-by: Jinbo <[email protected]>

fix: benchmark dump config

155e095

Signed-off-by: Jinbo <[email protected]>

Chasingdreams6 requested a review from zhuohan123 as a code owner September 16, 2025 05:48

Jinbo and others added 4 commits September 16, 2025 17:31

Merge branch 'main' into wa_policy_pr

3a54b57

fix: pip compile pre-commit

ae5c500

Signed-off-by: Jinbo <[email protected]>

fix: fix test.txt to meet py3.12 ci/cd workflow

4c797fe

Signed-off-by: Jinbo <[email protected]>

Merge branch 'main' into wa_policy_pr

10fb03f

mergify bot added the needs-rebase label Sep 21, 2025

Jinbo added 2 commits September 24, 2025 17:49

Merge branch 'main' into wa_policy_pr

afe4cc5

Signed-off-by: Jinbo <[email protected]>

Merge branch 'wa_policy_pr' of github.com:Chasingdreams6/vllm into wa…

9401524

…_policy_pr

Chasingdreams6 requested a review from ApostaC as a code owner September 24, 2025 09:54

mergify bot removed the needs-rebase label Sep 24, 2025

hmellor reviewed Oct 8, 2025

View reviewed changes

Jinbo added 3 commits October 9, 2025 13:32

Merge branch 'main' into wa_policy_pr

d35c979

Signed-off-by: Jinbo <[email protected]>

fix: refine workload-aware policy docs

3e7c972

Signed-off-by: Jinbo <[email protected]>

Merge branch 'main' into wa_policy_pr

2409b95

mergify bot added the needs-rebase label Oct 10, 2025

	"hit_rate": sum(hit_tokens) / sum(input_lengths),
	"hit_rate": sum(hit_tokens) / sum(input_lengths) if sum(input_lengths) > 0 else 0.0,

Uh oh!

[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy #22236

Are you sure you want to change the base?

[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy #22236

Uh oh!

Conversation

Chasingdreams6 commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

PR Descrption

Test Plan

Setup

Demo

Performance Improvement

Qwen 7B model

Llama 70B model

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

hmellor Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

mergify bot commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chasingdreams6 commented Aug 5, 2025 •

edited by github-actions bot

Loading