-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy #22236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a workload-aware KVCache eviction policy, enhancing the cache eviction mechanism by leveraging workload type information. The changes include a new benchmark script (benchmark_wa.py
) and a profiling utility (profiler_utils.py
). The review identified potential ZeroDivisionError
and incorrect rate parameter calculation issues, which have been addressed in the review comments.
benchmarks/benchmark_wa.py
Outdated
|
||
metrics["extras"] = { | ||
"total_hit_tokens": sum(hit_tokens), | ||
"hit_rate": sum(hit_tokens) / sum(input_lengths), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can cause a ZeroDivisionError
if sum(input_lengths)
is 0. This can happen if no requests were processed (chosens
is empty), which would cause the benchmark to crash at the end and lose all results.
"hit_rate": sum(hit_tokens) / sum(input_lengths), | |
"hit_rate": sum(hit_tokens) / sum(input_lengths) if sum(input_lengths) > 0 else 0.0, |
benchmarks/profiler_utils.py
Outdated
params = expon.fit(data) | ||
if params[1] == 0: | ||
return 0 | ||
lambda_hat = 1 / params[1] | ||
return lambda_hat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation returns a lambda of 0 when the scale parameter from expon.fit
is 0. This typically happens when all data points in data
are identical. A lambda of 0 implies an infinite mean reuse time, which is incorrect if the constant reuse time is a small positive number. This can lead to suboptimal hyperparameter generation for the WA policy.
A more correct approach is to calculate the mean of the data directly in this case and return its reciprocal as the rate parameter lambda.
params = expon.fit(data) | |
if params[1] == 0: | |
return 0 | |
lambda_hat = 1 / params[1] | |
return lambda_hat | |
if params[1] == 0: | |
mean_reuse_time = np.mean(data) | |
return 1.0 / mean_reuse_time if mean_reuse_time > 0 else 1e9 |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jinbo <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: kim <[email protected]>
Signed-off-by: Jinbo <[email protected]>
Signed-off-by: Jinbo <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: kim <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Jinbo <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
PR Descrption
Nowadays, cloud providers typically use a unified serving engine deployed on GPUs to serve all request types (text, image, file, agentcalls, etc.) for better resource utilization. However, the mean response time of these workloads is different, causing KVCache reuse time differences. For example, humans respond faster when they process image/audio data than to the complex text or file results generated by the LLM. Based on our analysis of realworld LLM traffic from top cloud provider Aliyun Bailian, we found that the general kvcache eviction policy (like LRU) for KVCache may not be optimal.
This PR provides a new feature, the WorkloadAware KVCache policy (WA), enhancing the 'FreeKVCacheBlockQueue' data structure to 'WorkloadAwareFreeKVCacheBlockQueue'. This leverages the extra information (i.e., workload type) corresponding to each KVCache block's request to perform better cache eviction than the default LRU policy used by FreeKVCacheBlockQueue.
This PR introduces a new optional parameter for each request,
type_info
, which contains the workload type of the request set by the frontend client. For example, a client can set a request's workload type astext_1
, meaning this request is the first turn of a chat catalog, orfile_2
meaning the request is the second turn of file analysis. Using this workload tag, cloud providers can classify requests from different business scenarios and guide the vLLM engine to do cache eviction.Note that the WA policy can be applied beyond the traces from Aliyun Bailian. The WA policy can be useful in any deployment where one vLLM serving engine serves multiple frontend workloads (Chat, Multimodal, Reasoning, etc.). As soon as the client provides the workload tag in the request, the WA policy can leverage this to perform better cache eviction than LRU.
More details analysis about the production trace and the formula about our probability prediction model can be found in our paper (Appeared at USENIX ATC '25).
Test Plan
We evaluate the effectiveness of WA policy on 7B and 70B model in different GPU cache space.
Setup
Model: Qwen/Qwen2.5-7B-Instruct, meta/Llama-3.3-70B-Instruct
GPU: 1~4 x Nvidia A800 80GB, TP=4 when testing the 70B model.
Trace: Aliyun Bailian Trace
Qps: First hour 6qps, second hour 6ps.
Total elements: 43195
Average input length: 2337.99
Average output length: 430.34
Demo
The
benchmark/benchmark_wa.py
script demonstrates a basic implementation of the workload-aware policy's profiling and prediction workflow. This specially designed client simulates multi-turn dialogues by generating requests based on the previous turn's output.The
benchmark/profiler_utils.py
module provides a cache simulator to profile KVCache reuse patterns across different workloads.The Bailian Traces dataset contains a two-hour trace at 6 queries per second (QPS). We utilize the first hour's trace to:
Profile KVCache reuse patterns for various workloads
Generate and export a hyperparameter configuration file
Subsequently, we launch a vLLM engine that loads this hyperparameter file to serve the second hour's trace.
Additionally,
benchmark_wa.py
generates detailed metrics files for analyzing both Query Time to First Token (QTTFT) and Time Per Output Token (TPOT) performance.Performance Improvement
Since KVCache hits primarily reduce Time to First Token (TTFT) latency, and PrefillDecoding (PD) disaggregation has become prevalent in modern cloud provider deployments, we tested the PrefillOnly component (representing the PrefillNode in PD disaggregation) using the 6 QPS trace data. These tests were conducted across varying GPU KVCache block allocations. The reported queued TTFT metric includes request queuing time, which is particularly critical for user experience evaluation.
Qwen 7B model
The max_num_batch_tokens is set as 16384 to improve the GPU utilization. The GPU memory utlization is 0.9. We use the hyperparamter 'num-gpu-blocks-override' to change the cache space.
We can see that the WA policy can get the cache hit rate improvement from 2.5% to 24.6% than LRU, and reduce the qttft from 0.7% to 52% than LRU. WA policy is better when the cache space is relatively limited.
Llama 70B model
Since the system throughtput is 1~2 qps when inferencing the 70B model, we sample the second hour's 6qps trace to 2qps. we prove the ratio of different turns remains the same.
We can see that the WA policy can get the cache hit rate improvement from 0.7% to 28% than LRU, and reduce the qttft from 4.5% to 46% than LRU.
(Optional) Documentation Update
The documentation of WA is at docs/features/workload_aware_policy.md. You can check the implementation details in it.