Skip to content

Conversation

@liji-nv
Copy link
Collaborator

@liji-nv liji-nv commented Jan 30, 2026

…samples of GSM8K

Summary by CodeRabbit

Release Notes

  • Tests
    • Enhanced test infrastructure with new execution modes (eager and torch_compile_fast) for improved performance testing coverage.
    • Added support for configurable sampling parameters in accuracy evaluations, enabling flexible test iteration control.
    • Expanded test parameterization across multiple GPU configurations for comprehensive compatibility validation.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

This pull request refactors test execution modes and parametrization across the accuracy testing framework. Changes introduce a fast_mode parameter and num_samples propagation mechanism to support fast test execution paths, shift test configurations from explicit torch_compile flags to standardized modes (eager, torch_compile_fast), and reorganize test duration mappings and test list configurations to reflect new parameterization schemes.

Changes

Cohort / File(s) Summary
Test Duration Mappings
tests/integration/defs/.test_durations
Reorganized duration entry keys from parameter-based identifiers to new forms with explicit mode flags (eager, torch_compile_fast, chunked_prefill variants), maintaining numerical values while updating the mapping schema.
Accuracy Evaluation Core
tests/integration/defs/accuracy/accuracy_core.py
Added optional num_samples parameter to get_hypothesis_testing_params() and AccuracyTask.evaluate() methods to enable external control of hypothesis testing sample size, with fallback to per-entry or default values.
PyTorch Accuracy Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Extended test signatures with fast_mode parameter across numerous test methods; updated _get_default_torch_compile_config() to accept enable_piecewise_cuda_graph parameter; added conditional logic to pass reduced sample counts (50 samples) during fast mode evaluation.
Accuracy Test Lists
tests/integration/test_lists/qa/llm_function_core.txt, llm_function_rtx6k.txt
Replaced test case identifiers from per-flag configurations to new parametrized variants using eager and torch_compile modes, altering test naming patterns for Llama, DeepSeek, and Qwen model blocks.
GPU Test Database Configurations
tests/integration/test_lists/test-db/l0_b200.yml, l0_b300.yml, l0_dgx_b200.yml, l0_dgx_b300.yml, l0_dgx_h100.yml, l0_dgx_h200.yml, l0_gb200_multi_gpus.yml, l0_gb300_multi_gpus.yml, l0_h100.yml, l0_rtx_pro_6000.yml
Systematically replaced test entries using explicit torch_compile boolean flags with new standardized mode-based configurations (eager, torch_compile_fast); reorganized backend/precision/parallelism combinations; added ISOLATION markers to select test variants.
Test Waiver List
tests/integration/test_lists/waives.txt
Updated waived test locus entries to use new parametrization scheme, replacing torch_compile=False/True with eager or implicit torch_compile flags.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is almost entirely blank - it only contains the repository template with no filled-in details about what was changed, why, what tests cover the changes, or confirmation of the PR checklist items. Fill in the Description, Test Coverage, and PR Checklist sections to explain the changes, list affected tests, and confirm adherence to coding guidelines and testing requirements.
Docstring Coverage ⚠️ Warning Docstring coverage is 6.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is partially related to the changeset - it mentions accelerating L0 torch compile tests by reducing samples, which aligns with adding a fast_mode parameter and num_samples override to tests, but is truncated and doesn't fully convey the parametrization changes. Complete the truncated title (appears to end with ellipsis) and clarify that it also involves adding fast_mode test parametrization beyond just reducing samples.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tests/integration/defs/accuracy/accuracy_core.py (2)

176-182: ⚠️ Potential issue | 🟠 Major

Validate num_samples overrides before computing hypothesis thresholds.
Negative values will produce invalid math (and num_samples or ... also masks accidental 0). Consider an explicit None check plus a positive-value guard.

🛡️ Suggested fix
-        return HypothesisTestingParams(
-            ref_accuracy=entry.get("accuracy"),
-            alpha=entry.get("alpha", self.ALPHA),
-            beta=entry.get("beta", self.BETA),
-            sigma=entry.get("sigma", self.SIGMA),
-            num_samples=num_samples
-            or entry.get("num_samples", self.NUM_SAMPLES),
-            higher_is_better=entry.get("higher_is_better",
-                                       self.HIGHER_IS_BETTER))
+        resolved_num_samples = (entry.get("num_samples", self.NUM_SAMPLES)
+                                if num_samples is None else num_samples)
+        if resolved_num_samples is None or resolved_num_samples <= 0:
+            raise ValueError("num_samples must be a positive integer")
+        return HypothesisTestingParams(
+            ref_accuracy=entry.get("accuracy"),
+            alpha=entry.get("alpha", self.ALPHA),
+            beta=entry.get("beta", self.BETA),
+            sigma=entry.get("sigma", self.SIGMA),
+            num_samples=resolved_num_samples,
+            higher_is_better=entry.get("higher_is_better",
+                                       self.HIGHER_IS_BETTER))

148-160: ⚠️ Potential issue | 🟡 Minor

Document the new num_samples parameter in the docstring.

📝 Docstring update
 def get_hypothesis_testing_params(self,
                                   num_samples: Optional[int] = None,
                                   **acc_specs) -> HypothesisTestingParams:
     """Get hypothesis testing parameters via accuracy specifications.

     Args:
+        num_samples: Optional override for the number of samples.
         acc_specs: Accuracy specifications, currently including:
As per coding guidelines, use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

1-1: ⚠️ Potential issue | 🟡 Minor

Update SPDX copyright year to 2026.

The file was modified in 2026 but the header still cites 2025.

🧾 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.


1513-1569: ⚠️ Potential issue | 🟡 Minor

Wire fast_mode into evaluation (or drop it here).

fast_mode is parametrized but unused (Ruff ARG002), so fast variants still run full GSM8K evaluation.

💡 Suggested fix
-            task.evaluate(llm)
+            task.evaluate(llm, num_samples=50 if fast_mode else None)

742-779: ⚠️ Potential issue | 🔴 Critical

Add fast_mode parameter to test_fp4_tp2pp2 function signature.

The @pytest.mark.parametrize decorator at lines 743–744 provides fast_mode, but the function signature at line 747 doesn't accept it. The function body uses fast_mode at lines 769 and 778, which will raise a pytest argument error and trigger Ruff F821.

Fix
-def test_fp4_tp2pp2(self, enable_gemm_allreduce_fusion, torch_compile):
+def test_fp4_tp2pp2(self, enable_gemm_allreduce_fusion, torch_compile,
+                    fast_mode):
🤖 Fix all issues with AI agents
In `@tests/integration/defs/.test_durations`:
- Around line 306-321: The durations file is missing 42 entries for the
torch_compile_fast parameter variants (e.g., keys similar to
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]");
add corresponding keys that replace "torch_compile" with "torch_compile_fast"
for each existing test variant and set their durations by copying or deriving
from the matching torch_compile entries so tests in l0_b200.yml,
l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml and
l0_rtx_pro_6000.yml have explicit timeouts.

In `@tests/integration/test_lists/qa/llm_function_rtx6k.txt`:
- Around line 1-40: The RTX6K test list is missing the parametrized
"torch_compile_fast" node variant produced by the test functions test_bfloat16
and test_nvfp4; either add the corresponding "torch_compile_fast" entries for
each parametrization to this list or explicitly document/guard the
parametrization in the tests to skip/disable the fast compile variant on this
GPU. Locate the parametrization for TestDeepSeekV3Lite::test_bfloat16 and
test_nvfp4 and either (A) append the generated torch_compile_fast cases to the
RTX6K list, or (B) change the test parametrization or add a GPU-target check to
prevent generating torch_compile_fast for RTX6K and update the test list to
match that behavior.

In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Around line 52-63: The .test_durations file is missing entries for the new
torch_compile_fast parametrizations; add duration estimates for each listed test
variant (or copy/derive the value from the corresponding eager/torch_compile
baseline) so they don't use default timeouts: add entries for
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast],
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
and ::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]; ensure naming in
.test_durations exactly matches these test ids and use the same units/format as
existing entries.

Comment on lines +306 to +321
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-eager]": 307.12596721109,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]": 443.91388061689213,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager]": 191.10617867391557,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile]": 166.85348949534819,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]": 167.15153613401344,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM]": 90.12104846700095,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]": 1112.0988524899585,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]": 979.2759481471148,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=False]": 237.24446990108117,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=True]": 226.39608797896653,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=False]": 174.38962662010454,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=True]": 313.69273760309443,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile=False]": 409.8932851999998,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile=True]": 344.8807112099603,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=False]": 103.82129427790642,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=True]": 164.91815144987777,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-eager]": 237.24446990108117,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile]": 226.39608797896653,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager]": 174.38962662010454,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile]": 313.69273760309443,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-eager]": 409.8932851999998,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile]": 344.8807112099603,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-eager]": 103.82129427790642,
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile]": 164.91815144987777,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import json, pathlib
dur = json.load(open("tests/integration/defs/.test_durations"))
missing = []
for path in pathlib.Path("tests/integration/test_lists").rglob("*"):
    if path.suffix not in {".txt", ".yml"}:
        continue
    for raw in path.read_text().splitlines():
        line = raw.strip()
        if not line or line.startswith("#"):
            continue
        if line.startswith("- "):
            line = line[2:].strip()
        if "torch_compile_fast" in line and line not in dur:
            missing.append(f"{path}:{line}")
print("missing:", len(missing))
print("\n".join(missing[:50]))
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 9086


🏁 Script executed:

# Check test_lists directory structure
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1142


🏁 Script executed:

# Check a sample of test_lists content to see torch_compile_fast usage
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) -exec grep -l "torch_compile_fast" {} \; | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 381


🏁 Script executed:

# Get count of torch_compile_fast entries in test_lists
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) -exec grep -c "torch_compile_fast" {} + | awk '{sum+=$1} END {print "Total torch_compile_fast mentions:", sum}'

Repository: NVIDIA/TensorRT-LLM

Length of output: 99


🏁 Script executed:

# Check the durations file structure and existing keys
python - <<'PY'
import json
dur = json.load(open("tests/integration/defs/.test_durations"))
print(f"Total duration entries: {len(dur)}")
torch_compile_keys = [k for k in dur.keys() if "torch_compile" in k]
torch_compile_fast_keys = [k for k in dur.keys() if "torch_compile_fast" in k]
print(f"Keys with 'torch_compile': {len(torch_compile_keys)}")
print(f"Keys with 'torch_compile_fast': {len(torch_compile_fast_keys)}")
if torch_compile_fast_keys:
    print("Sample torch_compile_fast keys:")
    for key in torch_compile_fast_keys[:3]:
        print(f"  {key}")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 154


Add duration entries for 42 missing torch_compile_fast test cases.

The durations file currently has no entries for torch_compile_fast parameter variants, but 42 test cases with this parameter exist across test list files (l0_b200.yml, l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml, l0_rtx_pro_6000.yml). Without these duration keys, tests will fall back to default timeouts, risking failures or excessive delays.

🤖 Prompt for AI Agents
In `@tests/integration/defs/.test_durations` around lines 306 - 321, The durations
file is missing 42 entries for the torch_compile_fast parameter variants (e.g.,
keys similar to
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]");
add corresponding keys that replace "torch_compile" with "torch_compile_fast"
for each existing test variant and set their durations by copying or deriving
from the matching torch_compile entries so tests in l0_b200.yml,
l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml and
l0_rtx_pro_6000.yml have explicit timeouts.

Comment on lines +1 to +40
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-eager]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find tests -name "test_llm_api_pytorch.py" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 118


🏁 Script executed:

rg -n "class TestDeepSeekV3Lite" tests -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 1347


🏁 Script executed:

rg -n "torch_compile|torch_compile_fast|eager" tests/accuracy/test_llm_api_pytorch.py | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 141


🏁 Script executed:

rg -n "@pytest.mark.parametrize" tests/accuracy/test_llm_api_pytorch.py | grep -E "torch_compile|eager"

Repository: NVIDIA/TensorRT-LLM

Length of output: 141


🏁 Script executed:

rg -n "torch_compile|torch_compile_fast|eager" tests/integration/defs/accuracy/test_llm_api_pytorch.py | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 6716


🏁 Script executed:

rg -n "class TestDeepSeekV3Lite" tests/integration/defs/accuracy/test_llm_api_pytorch.py -A 50 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 2945


🏁 Script executed:

sed -n '1403,1500p' tests/integration/defs/accuracy/test_llm_api_pytorch.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 5076


The test list is missing torch_compile_fast variants that the parametrization generates.

The parametrization in test_bfloat16 and test_nvfp4 creates three node ID variants (eager, torch_compile, torch_compile_fast), but this RTX6K list contains only eager and torch_compile entries. The torch_compile_fast variants will not run on RTX6K despite being generated by the code. Confirm whether this exclusion is intentional for this GPU target.

🤖 Prompt for AI Agents
In `@tests/integration/test_lists/qa/llm_function_rtx6k.txt` around lines 1 - 40,
The RTX6K test list is missing the parametrized "torch_compile_fast" node
variant produced by the test functions test_bfloat16 and test_nvfp4; either add
the corresponding "torch_compile_fast" entries for each parametrization to this
list or explicitly document/guard the parametrization in the tests to
skip/disable the fast compile variant on this GPU. Locate the parametrization
for TestDeepSeekV3Lite::test_bfloat16 and test_nvfp4 and either (A) append the
generated torch_compile_fast cases to the RTX6K list, or (B) change the test
parametrization or add a GPU-target check to prevent generating
torch_compile_fast for RTX6K and update the test list to match that behavior.

Comment on lines +52 to 63
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM] TIMEOUT (90)
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_dummy_load_format
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_with_eagle3[xgrammar-eagle3_one_model=True]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_with_eagle3[xgrammar-eagle3_one_model=False]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=False]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=True]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=False]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=True]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-eager]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast]
- accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

# First check if files exist
if [ ! -f "defs/.test_durations" ]; then
    echo "ERROR: .test_durations not found"
    exit 1
fi

if [ ! -f "test_lists/test-db/l0_h100.yml" ]; then
    echo "ERROR: l0_h100.yml not found"
    exit 1
fi

# Run the verification script from the comment
python3 - <<'PY'
import json
import pathlib

dur_path = pathlib.Path("defs/.test_durations")
test_list_path = pathlib.Path("test_lists/test-db/l0_h100.yml")

if not dur_path.exists():
    print(f"ERROR: {dur_path} does not exist")
    exit(1)

if not test_list_path.exists():
    print(f"ERROR: {test_list_path} does not exist")
    exit(1)

try:
    dur = json.load(open(dur_path))
except Exception as e:
    print(f"ERROR loading durations: {e}")
    exit(1)

missing = []
found = []
torch_compile_fast_lines = []

for line_num, raw in enumerate(test_list_path.read_text().splitlines(), 1):
    line = raw.strip()
    if not line or line.startswith("#"):
        continue
    if line.startswith("- "):
        line = line[2:].strip()
    
    if "torch_compile_fast" in line:
        torch_compile_fast_lines.append((line_num, line))
        if line in dur:
            found.append(line)
        else:
            missing.append(line)

print(f"Total torch_compile_fast entries found: {len(torch_compile_fast_lines)}")
print(f"Entries in .test_durations: {len(found)}")
print(f"Missing from .test_durations: {len(missing)}")
print()

if missing:
    print("MISSING ENTRIES:")
    for item in missing[:5]:  # Show first 5
        print(f"  {item}")
    if len(missing) > 5:
        print(f"  ... and {len(missing) - 5} more")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 917


🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

# Check the structure of .test_durations
echo "=== .test_durations file info ==="
wc -l defs/.test_durations
echo ""
echo "=== First few entries ==="
python3 -c "
import json
dur = json.load(open('defs/.test_durations'))
count = 0
for key in list(dur.keys())[:5]:
    print(f'{key}: {dur[key]}')
    count += 1
print(f'... Total entries in .test_durations: {len(dur)}')
"

Repository: NVIDIA/TensorRT-LLM

Length of output: 590


🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

python3 - <<'PY'
import json
import pathlib

dur_path = pathlib.Path("defs/.test_durations")
dur = json.load(open(dur_path))

# Check if base versions (without torch_compile_fast) exist
test_bases = [
    "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager]",
    "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager]",
    "accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True]",
]

print("=== Checking base versions (without torch_compile_fast) ===")
for base in test_bases:
    if base in dur:
        print(f"✓ FOUND: {base}")
        print(f"  Duration: {dur[base]}")
    else:
        print(f"✗ NOT FOUND: {base}")
        # Try partial matching
        partial_matches = [k for k in dur if base.split("::")[1] in k and "torch_compile" not in k]
        if partial_matches:
            print(f"  But found similar: {partial_matches[0] if partial_matches else 'none'}")
    print()
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 717


🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

# Check if .test_durations was modified or if l0_h100.yml is new
if [ -d ".git" ]; then
    git status defs/.test_durations test_lists/test-db/l0_h100.yml 2>&1 | head -20
fi

# Also check for comments or patterns in the test_durations about torch_compile
if grep -i "torch_compile" defs/.test_durations > /dev/null 2>&1; then
    echo "torch_compile found in .test_durations"
    grep -i "torch_compile" defs/.test_durations
else
    echo "torch_compile NOT found in .test_durations"
fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 1458


🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

# Check what variants exist in the actual test parametrization
python3 - <<'PY'
import pathlib
import re

test_file = pathlib.Path("accuracy/test_llm_api_pytorch.py")
if test_file.exists():
    content = test_file.read_text()
    
    # Look for torch_compile references
    torch_compile_refs = re.findall(r'torch_compile[_\w]*', content)
    unique_refs = sorted(set(torch_compile_refs))
    
    print("torch_compile variants found in test file:")
    for ref in unique_refs:
        print(f"  - {ref}")
else:
    print("test_llm_api_pytorch.py not found")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 96


🏁 Script executed:

#!/bin/bash
cd tests/integration || exit 1

python3 - <<'PY'
import pathlib

test_list_path = pathlib.Path("test_lists/test-db/l0_h100.yml")
lines = test_list_path.read_text().splitlines()

# Lines mentioned: 52-63, 66-69, 80-81, 132-133
affected_ranges = [(52, 63), (66, 69), (80, 81), (132, 133)]

all_torch_compile_fast = []

for start, end in affected_ranges:
    print(f"\n=== Lines {start}-{end} ===")
    for i in range(start-1, min(end, len(lines))):
        line = lines[i].strip()
        if line and not line.startswith("#"):
            if "torch_compile_fast" in line:
                all_torch_compile_fast.append(line)
                print(f"Line {i+1}: {line[:100]}...")

print(f"\n=== SUMMARY ===")
print(f"Total torch_compile_fast entries in affected ranges: {len(all_torch_compile_fast)}")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 1003


Add missing torch_compile_fast entries to .test_durations.

The torch_compile_fast parametrization has been added to l0_h100.yml (7 entries across lines 52–63, 66–69, 80–81, 132–133), but none of these variants are present in .test_durations. Tests without duration estimates will use default timeouts, which may not reflect the actual runtime of this optimization variant and cause unexpected failures. Add duration entries for all torch_compile_fast variants or derive them from the corresponding eager or torch_compile baselines:

  • accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]
  • accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast] (line 133)
🤖 Prompt for AI Agents
In `@tests/integration/test_lists/test-db/l0_h100.yml` around lines 52 - 63, The
.test_durations file is missing entries for the new torch_compile_fast
parametrizations; add duration estimates for each listed test variant (or
copy/derive the value from the corresponding eager/torch_compile baseline) so
they don't use default timeouts: add entries for
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast],
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
and ::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]; ensure naming in
.test_durations exactly matches these test ids and use the same units/format as
existing entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant