Skip to content

Conversation

@HuiGao-NV
Copy link
Collaborator

@HuiGao-NV HuiGao-NV commented Jan 29, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed potential crash in CUDA graph cleanup when graph runner is not initialized
  • Refactor

    • Simplified KV cache initialization and memory allocation workflow
    • Streamlined executor setup and resource management initialization
    • Removed redundant estimation and profiling steps to improve startup efficiency

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

This pull request removes the KV cache estimation/profiling infrastructure across multiple components. The estimating_kv_cache parameter is eliminated from cache managers, dummy request scaffolding and profiling code are deleted, and initialization flows are simplified with a new VSWA detection helper.

Changes

Cohort / File(s) Summary
KV Cache Manager Parameter Cleanup
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Removed is_estimating_kv_cache parameter from KV cache manager constructors and related call sites; no functional logic changes in these modules.
KV Cache Estimation Infrastructure Refactor
tensorrt_llm/_torch/pyexecutor/_util.py
Introduced is_vswa() helper for VSWA detection. Removed profiling scaffolding (_create_dummy_* functions, profiling state), simplified memory estimation logic, removed allocated_bytes parameter from memory calculation, and refactored KV cache manager construction to drop estimating_kv_cache and unconditionally pass kv_connector_manager.
Executor Creation Simplification
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Removed _adjust_torch_mem_fraction import and calls, eliminated two-stage KV cache estimation flow, replaced with single unconditional creation path, added gc.collect() calls, removed start_worker parameter, and streamlined memory/resource allocation blocks.
Resource Manager VSWA Routing
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Removed is_estimating_kv_cache parameter and replaced estimation-dry-run branch with VSWA-aware path using C\+\+ or Python-based block calculations depending on is_vswa state.
Safety Enhancement
tensorrt_llm/_torch/pyexecutor/model_engine.py
Added null check guard for self.cuda_graph_runner before calling clear() in _release_cuda_graphs().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely template boilerplate with all required sections left blank (Description, Test Coverage, PR Checklist items unchecked), providing no actual information about the changes. Fill in the Description section explaining what was changed and why, add Test Coverage section listing relevant tests, and complete the PR Checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: removing memory estimation and using fraction-based KV cache memory calculation, which is the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34064 [ run ] triggered by Bot. Commit: 3ee143d

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 114-128: The _cal_max_memory method can return a negative
available_kv_mem which leads to invalid max_tokens; modify _cal_max_memory (the
function named _cal_max_memory) to clamp available_kv_mem to a minimum of 0
before returning (e.g., if available_kv_mem < 0 set it to 0) and update the
method docstring/note to reflect that negative available KV-cache memory is
treated as zero to avoid downstream sizing errors; reference variables
kv_size_per_token, available_kv_mem, peak_memory, total_gpu_memory, and fraction
when making the change.
- Around line 232-236: Move the docstring for build_managers so it is the first
statement in the function (before any calls like
self.configure_kv_cache_capacity()); currently the triple-quoted string after
configure_kv_cache_capacity() is a no-op and won't be picked up as the function
docstring. Edit the build_managers definition to place the docstring immediately
after def build_managers(...):, then keep the call to
self.configure_kv_cache_capacity() and subsequent creation of kv_cache_manager
via self._create_kv_cache_manager(self._model_engine) unchanged.
- Around line 169-173: The local assignment is shadowing the module-level
function is_vswa (is_vswa = is_vswa(self._kv_cache_config)), causing
UnboundLocalError; rename the boolean local variable (for example to
vswa_enabled or is_vswa_enabled) and keep the RHS call to the is_vswa function
intact (is_vswa(self._kv_cache_config)); update subsequent references (e.g., the
if check and logger.warning usage) to use the new variable name so the function
name is not shadowed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 273-300: The VSWA branch currently sets blocks_per_window via
calculate_max_num_blocks_from_cpp but does not initialize the legacy attributes
blocks_in_primary_pool and blocks_in_secondary_pool, causing AttributeError
downstream; fix by computing conservative/minimum values from the
blocks_per_window mapping returned by calculate_max_num_blocks_from_cpp (e.g.,
take the minimum primary and secondary block counts across all windows) and
assign them to self.blocks_in_primary_pool and self.blocks_in_secondary_pool
inside the is_vswa branch so code paths expecting these legacy attributes (used
by calculate_max_num_blocks_from_cpp, max_attention_window_vec, flashinfer.py,
demollm.py, interface.py, rocket.py, dsa.py) continue to work.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

36-38: Remove the shadowing is_mla import.
is_mla is imported from both ._util and .config_utils; the latter shadows the former and makes intent unclear. Keep one source (likely config_utils) to avoid ambiguity. As per coding guidelines, avoid shadowing variables declared in an outer scope in Python.

♻️ Minimal fix
-from ._util import (KvCacheCreator, create_py_executor_instance,
-                    instantiate_sampler, is_mla, validate_feature_combination)
+from ._util import (KvCacheCreator, create_py_executor_instance,
+                    instantiate_sampler, validate_feature_combination)
tensorrt_llm/_torch/pyexecutor/_util.py (2)

2-2: Use module namespace imports for typing/speculative utilities

The new from typing import ... and from ..speculative import get_spec_decoder imports break the repo’s namespace-import rule. Please switch to module imports and qualify usages (e.g., typing.Dict, typing.Optional, speculative.get_spec_decoder).

♻️ Suggested update
-import os
-from typing import Dict, Optional
+import os
+import typing
@@
-from ..speculative import get_spec_decoder
+from .. import speculative
@@
-        return get_spec_decoder(sampler_args, engine.spec_config)
+        return speculative.get_spec_decoder(sampler_args, engine.spec_config)

As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

Also applies to: 26-27, 681-683


56-58: Add a docstring and return a strict bool in is_vswa

This helper can return an empty list/None instead of a bool. Add a brief Google‑style docstring and cast to bool for clarity and type stability.

♻️ Suggested update
-def is_vswa(kv_cache_config):
-    max_attention_window = kv_cache_config.max_attention_window
-    return max_attention_window and len(set(max_attention_window)) > 1
+def is_vswa(kv_cache_config) -> bool:
+    """Return True when VSWA is enabled via heterogeneous attention windows."""
+    max_attention_window = kv_cache_config.max_attention_window
+    return bool(max_attention_window) and len(set(max_attention_window)) > 1

As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

@HuiGao-NV HuiGao-NV marked this pull request as draft January 29, 2026 14:02
@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34076 [ run ] triggered by Bot. Commit: 8a83e02

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34064 [ run ] completed with state ABORTED. Commit: 3ee143d
/LLM/main/L0_MergeRequest_PR pipeline #26280 completed with status: 'ABORTED'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34076 [ run ] completed with state SUCCESS. Commit: 8a83e02
/LLM/main/L0_MergeRequest_PR pipeline #26290 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34159 [ run ] triggered by Bot. Commit: 66feb7f

@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34159 [ run ] completed with state SUCCESS. Commit: 66feb7f
/LLM/main/L0_MergeRequest_PR pipeline #26356 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34189 [ run ] triggered by Bot. Commit: 6b3feac

Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34205 [ run ] triggered by Bot. Commit: df5de48

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34189 [ run ] completed with state ABORTED. Commit: 6b3feac
LLM/main/L0_MergeRequest_PR #26382 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34205 [ run ] completed with state SUCCESS. Commit: df5de48
/LLM/main/L0_MergeRequest_PR pipeline #26392 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list "RTX5090-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34222 [ run ] triggered by Bot. Commit: 8675b1c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34222 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 7 AM PST on 1/30.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants