[TRTLLM-9782][feat] Skip memory estimation process and calculate kv cache memory directly from fraction #11102

HuiGao-NV · 2026-01-29T12:31:12Z

Summary by CodeRabbit

Bug Fixes
- Fixed potential crash in CUDA graph cleanup when graph runner is not initialized
Refactor
- Simplified KV cache initialization and memory allocation workflow
- Streamlined executor setup and resource management initialization
- Removed redundant estimation and profiling steps to improve startup efficiency

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

HuiGao-NV · 2026-01-29T12:34:00Z

/bot run --disable-fail-fast

coderabbitai · 2026-01-29T12:34:20Z

📝 Walkthrough

Walkthrough

This pull request removes the KV cache estimation/profiling infrastructure across multiple components. The estimating_kv_cache parameter is eliminated from cache managers, dummy request scaffolding and profiling code are deleted, and initialization flows are simplified with a new VSWA detection helper.

Changes

Cohort / File(s)	Summary
KV Cache Manager Parameter Cleanup `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`	Removed `is_estimating_kv_cache` parameter from KV cache manager constructors and related call sites; no functional logic changes in these modules.
KV Cache Estimation Infrastructure Refactor `tensorrt_llm/_torch/pyexecutor/_util.py`	Introduced `is_vswa()` helper for VSWA detection. Removed profiling scaffolding (`_create_dummy_*` functions, profiling state), simplified memory estimation logic, removed `allocated_bytes` parameter from memory calculation, and refactored KV cache manager construction to drop `estimating_kv_cache` and unconditionally pass `kv_connector_manager`.
Executor Creation Simplification `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Removed `_adjust_torch_mem_fraction` import and calls, eliminated two-stage KV cache estimation flow, replaced with single unconditional creation path, added `gc.collect()` calls, removed `start_worker` parameter, and streamlined memory/resource allocation blocks.
Resource Manager VSWA Routing `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Removed `is_estimating_kv_cache` parameter and replaced estimation-dry-run branch with VSWA-aware path using C\+\+ or Python-based block calculations depending on `is_vswa` state.
Safety Enhancement `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Added null check guard for `self.cuda_graph_runner` before calling `clear()` in `_release_cuda_graphs()`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely template boilerplate with all required sections left blank (Description, Test Coverage, PR Checklist items unchecked), providing no actual information about the changes.	Fill in the Description section explaining what was changed and why, add Test Coverage section listing relevant tests, and complete the PR Checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: removing memory estimation and using fraction-based KV cache memory calculation, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-01-29T12:40:04Z

PR_Github #34064 [ run ] triggered by Bot. Commit: 3ee143d

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 114-128: The _cal_max_memory method can return a negative
available_kv_mem which leads to invalid max_tokens; modify _cal_max_memory (the
function named _cal_max_memory) to clamp available_kv_mem to a minimum of 0
before returning (e.g., if available_kv_mem < 0 set it to 0) and update the
method docstring/note to reflect that negative available KV-cache memory is
treated as zero to avoid downstream sizing errors; reference variables
kv_size_per_token, available_kv_mem, peak_memory, total_gpu_memory, and fraction
when making the change.
- Around line 232-236: Move the docstring for build_managers so it is the first
statement in the function (before any calls like
self.configure_kv_cache_capacity()); currently the triple-quoted string after
configure_kv_cache_capacity() is a no-op and won't be picked up as the function
docstring. Edit the build_managers definition to place the docstring immediately
after def build_managers(...):, then keep the call to
self.configure_kv_cache_capacity() and subsequent creation of kv_cache_manager
via self._create_kv_cache_manager(self._model_engine) unchanged.
- Around line 169-173: The local assignment is shadowing the module-level
function is_vswa (is_vswa = is_vswa(self._kv_cache_config)), causing
UnboundLocalError; rename the boolean local variable (for example to
vswa_enabled or is_vswa_enabled) and keep the RHS call to the is_vswa function
intact (is_vswa(self._kv_cache_config)); update subsequent references (e.g., the
if check and logger.warning usage) to use the new variable name so the function
name is not shadowed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 273-300: The VSWA branch currently sets blocks_per_window via
calculate_max_num_blocks_from_cpp but does not initialize the legacy attributes
blocks_in_primary_pool and blocks_in_secondary_pool, causing AttributeError
downstream; fix by computing conservative/minimum values from the
blocks_per_window mapping returned by calculate_max_num_blocks_from_cpp (e.g.,
take the minimum primary and secondary block counts across all windows) and
assign them to self.blocks_in_primary_pool and self.blocks_in_secondary_pool
inside the is_vswa branch so code paths expecting these legacy attributes (used
by calculate_max_num_blocks_from_cpp, max_attention_window_vec, flashinfer.py,
demollm.py, interface.py, rocket.py, dsa.py) continue to work.

🧹 Nitpick comments (3)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
36-38: Remove the shadowing is_mla import.
is_mla is imported from both ._util and .config_utils; the latter shadows the former and makes intent unclear. Keep one source (likely config_utils) to avoid ambiguity. As per coding guidelines, avoid shadowing variables declared in an outer scope in Python.
♻️ Minimal fix
-from ._util import (KvCacheCreator, create_py_executor_instance,
-                    instantiate_sampler, is_mla, validate_feature_combination)
+from ._util import (KvCacheCreator, create_py_executor_instance,
+                    instantiate_sampler, validate_feature_combination)
tensorrt_llm/_torch/pyexecutor/_util.py (2)
2-2: Use module namespace imports for typing/speculative utilities

The new from typing import ... and from ..speculative import get_spec_decoder imports break the repo’s namespace-import rule. Please switch to module imports and qualify usages (e.g., typing.Dict, typing.Optional, speculative.get_spec_decoder).
♻️ Suggested update
-import os
-from typing import Dict, Optional
+import os
+import typing
@@
-from ..speculative import get_spec_decoder
+from .. import speculative
@@
-        return get_spec_decoder(sampler_args, engine.spec_config)
+        return speculative.get_spec_decoder(sampler_args, engine.spec_config)
As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

Also applies to: 26-27, 681-683

56-58: Add a docstring and return a strict bool in is_vswa

This helper can return an empty list/None instead of a bool. Add a brief Google‑style docstring and cast to bool for clarity and type stability.
♻️ Suggested update
-def is_vswa(kv_cache_config):
-    max_attention_window = kv_cache_config.max_attention_window
-    return max_attention_window and len(set(max_attention_window)) > 1
+def is_vswa(kv_cache_config) -> bool:
+    """Return True when VSWA is enabled via heterogeneous attention windows."""
+    max_attention_window = kv_cache_config.max_attention_window
+    return bool(max_attention_window) and len(set(max_attention_window)) > 1
As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

tensorrt_llm/_torch/pyexecutor/_util.py

tensorrt_llm/_torch/pyexecutor/resource_manager.py

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV · 2026-01-29T14:32:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-29T14:38:33Z

PR_Github #34076 [ run ] triggered by Bot. Commit: 8a83e02

tensorrt-cicd · 2026-01-29T14:38:35Z

PR_Github #34064 [ run ] completed with state ABORTED. Commit: 3ee143d
/LLM/main/L0_MergeRequest_PR pipeline #26280 completed with status: 'ABORTED'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

tensorrt-cicd · 2026-01-29T19:06:03Z

PR_Github #34076 [ run ] completed with state SUCCESS. Commit: 8a83e02
/LLM/main/L0_MergeRequest_PR pipeline #26290 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

HuiGao-NV · 2026-01-30T03:05:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-30T03:12:33Z

PR_Github #34159 [ run ] triggered by Bot. Commit: 66feb7f

HuiGao-NV · 2026-01-30T07:40:22Z

/bot run

tensorrt-cicd · 2026-01-30T07:47:48Z

PR_Github #34159 [ run ] completed with state SUCCESS. Commit: 66feb7f
/LLM/main/L0_MergeRequest_PR pipeline #26356 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

tensorrt-cicd · 2026-01-30T07:49:22Z

PR_Github #34189 [ run ] triggered by Bot. Commit: 6b3feac

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV · 2026-01-30T08:49:34Z

/bot run

tensorrt-cicd · 2026-01-30T08:55:41Z

PR_Github #34205 [ run ] triggered by Bot. Commit: df5de48

tensorrt-cicd · 2026-01-30T08:55:44Z

PR_Github #34189 [ run ] completed with state ABORTED. Commit: 6b3feac
LLM/main/L0_MergeRequest_PR #26382 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2026-01-30T10:10:06Z

PR_Github #34205 [ run ] completed with state SUCCESS. Commit: df5de48
/LLM/main/L0_MergeRequest_PR pipeline #26392 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV · 2026-01-30T12:03:40Z

/bot run --stage-list "RTX5090-PyTorch-1"

tensorrt-cicd · 2026-01-30T12:10:15Z

PR_Github #34222 [ run ] triggered by Bot. Commit: 8675b1c

tensorrt-cicd · 2026-01-30T12:10:16Z

PR_Github #34222 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 7 AM PST on 1/30.

HuiGao-NV requested review from a team as code owners January 29, 2026 12:31

HuiGao-NV requested review from Naveassaf, bmarimuthu-nv, tomeras91 and yechank-nvidia January 29, 2026 12:31

HuiGao-NV force-pushed the remove_mem_est branch from 227bbff to 3ee143d Compare January 29, 2026 12:33

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/_util.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/_util.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/_util.py Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/resource_manager.py Show resolved Hide resolved

HuiGao-NV marked this pull request as draft January 29, 2026 14:02

Remove memory estimation process

8a83e02

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV force-pushed the remove_mem_est branch from 3ee143d to 8a83e02 Compare January 29, 2026 14:29

HuiGao-NV force-pushed the remove_mem_est branch from 66feb7f to 6b3feac Compare January 30, 2026 07:36

Fix test case

df5de48

Signed-off-by: Hui Gao <[email protected]>

HuiGao-NV force-pushed the remove_mem_est branch from 6b3feac to df5de48 Compare January 30, 2026 08:48

Fix test case

8675b1c

Signed-off-by: Hui Gao <[email protected]>

[TRTLLM-9782][feat] Skip memory estimation process and calculate kv cache memory directly from fraction #11102

Are you sure you want to change the base?

[TRTLLM-9782][feat] Skip memory estimation process and calculate kv cache memory directly from fraction #11102

Conversation

HuiGao-NV commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

HuiGao-NV commented Jan 29, 2026

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuiGao-NV commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

tensorrt-cicd commented Jan 29, 2026

Uh oh!

HuiGao-NV commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

HuiGao-NV commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

HuiGao-NV commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

HuiGao-NV commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HuiGao-NV commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading