-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[TRTLLM-9782][feat] Skip memory estimation process and calculate kv cache memory directly from fraction #11102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
227bbff to
3ee143d
Compare
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis pull request removes the KV cache estimation/profiling infrastructure across multiple components. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
PR_Github #34064 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 114-128: The _cal_max_memory method can return a negative
available_kv_mem which leads to invalid max_tokens; modify _cal_max_memory (the
function named _cal_max_memory) to clamp available_kv_mem to a minimum of 0
before returning (e.g., if available_kv_mem < 0 set it to 0) and update the
method docstring/note to reflect that negative available KV-cache memory is
treated as zero to avoid downstream sizing errors; reference variables
kv_size_per_token, available_kv_mem, peak_memory, total_gpu_memory, and fraction
when making the change.
- Around line 232-236: Move the docstring for build_managers so it is the first
statement in the function (before any calls like
self.configure_kv_cache_capacity()); currently the triple-quoted string after
configure_kv_cache_capacity() is a no-op and won't be picked up as the function
docstring. Edit the build_managers definition to place the docstring immediately
after def build_managers(...):, then keep the call to
self.configure_kv_cache_capacity() and subsequent creation of kv_cache_manager
via self._create_kv_cache_manager(self._model_engine) unchanged.
- Around line 169-173: The local assignment is shadowing the module-level
function is_vswa (is_vswa = is_vswa(self._kv_cache_config)), causing
UnboundLocalError; rename the boolean local variable (for example to
vswa_enabled or is_vswa_enabled) and keep the RHS call to the is_vswa function
intact (is_vswa(self._kv_cache_config)); update subsequent references (e.g., the
if check and logger.warning usage) to use the new variable name so the function
name is not shadowed.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 273-300: The VSWA branch currently sets blocks_per_window via
calculate_max_num_blocks_from_cpp but does not initialize the legacy attributes
blocks_in_primary_pool and blocks_in_secondary_pool, causing AttributeError
downstream; fix by computing conservative/minimum values from the
blocks_per_window mapping returned by calculate_max_num_blocks_from_cpp (e.g.,
take the minimum primary and secondary block counts across all windows) and
assign them to self.blocks_in_primary_pool and self.blocks_in_secondary_pool
inside the is_vswa branch so code paths expecting these legacy attributes (used
by calculate_max_num_blocks_from_cpp, max_attention_window_vec, flashinfer.py,
demollm.py, interface.py, rocket.py, dsa.py) continue to work.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
36-38: Remove the shadowingis_mlaimport.
is_mlais imported from both._utiland.config_utils; the latter shadows the former and makes intent unclear. Keep one source (likelyconfig_utils) to avoid ambiguity. As per coding guidelines, avoid shadowing variables declared in an outer scope in Python.♻️ Minimal fix
-from ._util import (KvCacheCreator, create_py_executor_instance, - instantiate_sampler, is_mla, validate_feature_combination) +from ._util import (KvCacheCreator, create_py_executor_instance, + instantiate_sampler, validate_feature_combination)tensorrt_llm/_torch/pyexecutor/_util.py (2)
2-2: Use module namespace imports for typing/speculative utilitiesThe new
from typing import ...andfrom ..speculative import get_spec_decoderimports break the repo’s namespace-import rule. Please switch to module imports and qualify usages (e.g.,typing.Dict,typing.Optional,speculative.get_spec_decoder).♻️ Suggested update
-import os -from typing import Dict, Optional +import os +import typing @@ -from ..speculative import get_spec_decoder +from .. import speculative @@ - return get_spec_decoder(sampler_args, engine.spec_config) + return speculative.get_spec_decoder(sampler_args, engine.spec_config)As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
Also applies to: 26-27, 681-683
56-58: Add a docstring and return a strict bool inis_vswaThis helper can return an empty list/None instead of a
bool. Add a brief Google‑style docstring and cast toboolfor clarity and type stability.♻️ Suggested update
-def is_vswa(kv_cache_config): - max_attention_window = kv_cache_config.max_attention_window - return max_attention_window and len(set(max_attention_window)) > 1 +def is_vswa(kv_cache_config) -> bool: + """Return True when VSWA is enabled via heterogeneous attention windows.""" + max_attention_window = kv_cache_config.max_attention_window + return bool(max_attention_window) and len(set(max_attention_window)) > 1As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
Signed-off-by: Hui Gao <[email protected]>
3ee143d to
8a83e02
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #34076 [ run ] triggered by Bot. Commit: |
|
PR_Github #34064 [ run ] completed with state
|
|
PR_Github #34076 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34159 [ run ] triggered by Bot. Commit: |
66feb7f to
6b3feac
Compare
|
/bot run |
|
PR_Github #34159 [ run ] completed with state
|
|
PR_Github #34189 [ run ] triggered by Bot. Commit: |
Signed-off-by: Hui Gao <[email protected]>
6b3feac to
df5de48
Compare
|
/bot run |
|
PR_Github #34205 [ run ] triggered by Bot. Commit: |
|
PR_Github #34189 [ run ] completed with state |
|
PR_Github #34205 [ run ] completed with state
|
Signed-off-by: Hui Gao <[email protected]>
|
/bot run --stage-list "RTX5090-PyTorch-1" |
|
PR_Github #34222 [ run ] triggered by Bot. Commit: |
|
PR_Github #34222 [ run ] completed with state |
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.