-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[TRTLLM-9802][feat] Accelerate L0 torch compile test by reducing num … #11140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…samples of GSM8K Signed-off-by: Jin Li <[email protected]>
📝 WalkthroughWalkthroughThis pull request refactors test execution modes and parametrization across the accuracy testing framework. Changes introduce a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tests/integration/defs/accuracy/accuracy_core.py (2)
176-182:⚠️ Potential issue | 🟠 MajorValidate num_samples overrides before computing hypothesis thresholds.
Negative values will produce invalid math (andnum_samples or ...also masks accidental 0). Consider an explicitNonecheck plus a positive-value guard.🛡️ Suggested fix
- return HypothesisTestingParams( - ref_accuracy=entry.get("accuracy"), - alpha=entry.get("alpha", self.ALPHA), - beta=entry.get("beta", self.BETA), - sigma=entry.get("sigma", self.SIGMA), - num_samples=num_samples - or entry.get("num_samples", self.NUM_SAMPLES), - higher_is_better=entry.get("higher_is_better", - self.HIGHER_IS_BETTER)) + resolved_num_samples = (entry.get("num_samples", self.NUM_SAMPLES) + if num_samples is None else num_samples) + if resolved_num_samples is None or resolved_num_samples <= 0: + raise ValueError("num_samples must be a positive integer") + return HypothesisTestingParams( + ref_accuracy=entry.get("accuracy"), + alpha=entry.get("alpha", self.ALPHA), + beta=entry.get("beta", self.BETA), + sigma=entry.get("sigma", self.SIGMA), + num_samples=resolved_num_samples, + higher_is_better=entry.get("higher_is_better", + self.HIGHER_IS_BETTER))
148-160:⚠️ Potential issue | 🟡 MinorDocument the new num_samples parameter in the docstring.
As per coding guidelines, use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.📝 Docstring update
def get_hypothesis_testing_params(self, num_samples: Optional[int] = None, **acc_specs) -> HypothesisTestingParams: """Get hypothesis testing parameters via accuracy specifications. Args: + num_samples: Optional override for the number of samples. acc_specs: Accuracy specifications, currently including:tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
1-1:⚠️ Potential issue | 🟡 MinorUpdate SPDX copyright year to 2026.
The file was modified in 2026 but the header still cites 2025.
🧾 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
1513-1569:⚠️ Potential issue | 🟡 MinorWire
fast_modeinto evaluation (or drop it here).
fast_modeis parametrized but unused (Ruff ARG002), so fast variants still run full GSM8K evaluation.💡 Suggested fix
- task.evaluate(llm) + task.evaluate(llm, num_samples=50 if fast_mode else None)
742-779:⚠️ Potential issue | 🔴 CriticalAdd
fast_modeparameter totest_fp4_tp2pp2function signature.The
@pytest.mark.parametrizedecorator at lines 743–744 providesfast_mode, but the function signature at line 747 doesn't accept it. The function body usesfast_modeat lines 769 and 778, which will raise a pytest argument error and trigger Ruff F821.Fix
-def test_fp4_tp2pp2(self, enable_gemm_allreduce_fusion, torch_compile): +def test_fp4_tp2pp2(self, enable_gemm_allreduce_fusion, torch_compile, + fast_mode):
🤖 Fix all issues with AI agents
In `@tests/integration/defs/.test_durations`:
- Around line 306-321: The durations file is missing 42 entries for the
torch_compile_fast parameter variants (e.g., keys similar to
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]");
add corresponding keys that replace "torch_compile" with "torch_compile_fast"
for each existing test variant and set their durations by copying or deriving
from the matching torch_compile entries so tests in l0_b200.yml,
l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml and
l0_rtx_pro_6000.yml have explicit timeouts.
In `@tests/integration/test_lists/qa/llm_function_rtx6k.txt`:
- Around line 1-40: The RTX6K test list is missing the parametrized
"torch_compile_fast" node variant produced by the test functions test_bfloat16
and test_nvfp4; either add the corresponding "torch_compile_fast" entries for
each parametrization to this list or explicitly document/guard the
parametrization in the tests to skip/disable the fast compile variant on this
GPU. Locate the parametrization for TestDeepSeekV3Lite::test_bfloat16 and
test_nvfp4 and either (A) append the generated torch_compile_fast cases to the
RTX6K list, or (B) change the test parametrization or add a GPU-target check to
prevent generating torch_compile_fast for RTX6K and update the test list to
match that behavior.
In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Around line 52-63: The .test_durations file is missing entries for the new
torch_compile_fast parametrizations; add duration estimates for each listed test
variant (or copy/derive the value from the corresponding eager/torch_compile
baseline) so they don't use default timeouts: add entries for
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast],
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
and ::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]; ensure naming in
.test_durations exactly matches these test ids and use the same units/format as
existing entries.
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-eager]": 307.12596721109, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]": 443.91388061689213, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager]": 191.10617867391557, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile]": 166.85348949534819, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=FLASHINFER]": 167.15153613401344, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM]": 90.12104846700095, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=False-overlap_scheduler=False]": 1112.0988524899585, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[eagle3_one_model=True-overlap_scheduler=True]": 979.2759481471148, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=False]": 237.24446990108117, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile=True]": 226.39608797896653, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=False]": 174.38962662010454, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=True]": 313.69273760309443, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile=False]": 409.8932851999998, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile=True]": 344.8807112099603, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=False]": 103.82129427790642, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=True]": 164.91815144987777, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-eager]": 237.24446990108117, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=FLASHINFER-torch_compile]": 226.39608797896653, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager]": 174.38962662010454, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile]": 313.69273760309443, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-eager]": 409.8932851999998, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=FLASHINFER-torch_compile]": 344.8807112099603, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-eager]": 103.82129427790642, | ||
| "accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile]": 164.91815144987777, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import json, pathlib
dur = json.load(open("tests/integration/defs/.test_durations"))
missing = []
for path in pathlib.Path("tests/integration/test_lists").rglob("*"):
if path.suffix not in {".txt", ".yml"}:
continue
for raw in path.read_text().splitlines():
line = raw.strip()
if not line or line.startswith("#"):
continue
if line.startswith("- "):
line = line[2:].strip()
if "torch_compile_fast" in line and line not in dur:
missing.append(f"{path}:{line}")
print("missing:", len(missing))
print("\n".join(missing[:50]))
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 9086
🏁 Script executed:
# Check test_lists directory structure
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 1142
🏁 Script executed:
# Check a sample of test_lists content to see torch_compile_fast usage
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) -exec grep -l "torch_compile_fast" {} \; | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 381
🏁 Script executed:
# Get count of torch_compile_fast entries in test_lists
find tests/integration/test_lists -type f \( -name "*.txt" -o -name "*.yml" \) -exec grep -c "torch_compile_fast" {} + | awk '{sum+=$1} END {print "Total torch_compile_fast mentions:", sum}'Repository: NVIDIA/TensorRT-LLM
Length of output: 99
🏁 Script executed:
# Check the durations file structure and existing keys
python - <<'PY'
import json
dur = json.load(open("tests/integration/defs/.test_durations"))
print(f"Total duration entries: {len(dur)}")
torch_compile_keys = [k for k in dur.keys() if "torch_compile" in k]
torch_compile_fast_keys = [k for k in dur.keys() if "torch_compile_fast" in k]
print(f"Keys with 'torch_compile': {len(torch_compile_keys)}")
print(f"Keys with 'torch_compile_fast': {len(torch_compile_fast_keys)}")
if torch_compile_fast_keys:
print("Sample torch_compile_fast keys:")
for key in torch_compile_fast_keys[:3]:
print(f" {key}")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 154
Add duration entries for 42 missing torch_compile_fast test cases.
The durations file currently has no entries for torch_compile_fast parameter variants, but 42 test cases with this parameter exist across test list files (l0_b200.yml, l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml, l0_rtx_pro_6000.yml). Without these duration keys, tests will fall back to default timeouts, risking failures or excessive delays.
🤖 Prompt for AI Agents
In `@tests/integration/defs/.test_durations` around lines 306 - 321, The durations
file is missing 42 entries for the torch_compile_fast parameter variants (e.g.,
keys similar to
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=FLASHINFER-torch_compile]");
add corresponding keys that replace "torch_compile" with "torch_compile_fast"
for each existing test variant and set their durations by copying or deriving
from the matching torch_compile entries so tests in l0_b200.yml,
l0_dgx_b200.yml, l0_dgx_h100.yml, l0_gb200_multi_gpus.yml, l0_h100.yml and
l0_rtx_pro_6000.yml have explicit timeouts.
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=2-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile-enable_chunked_prefill=False] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=True-cuda_graph=False-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=True-cuda_graph=False-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=True-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-eager] | ||
| accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find tests -name "test_llm_api_pytorch.py" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 118
🏁 Script executed:
rg -n "class TestDeepSeekV3Lite" tests -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 1347
🏁 Script executed:
rg -n "torch_compile|torch_compile_fast|eager" tests/accuracy/test_llm_api_pytorch.py | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 141
🏁 Script executed:
rg -n "@pytest.mark.parametrize" tests/accuracy/test_llm_api_pytorch.py | grep -E "torch_compile|eager"Repository: NVIDIA/TensorRT-LLM
Length of output: 141
🏁 Script executed:
rg -n "torch_compile|torch_compile_fast|eager" tests/integration/defs/accuracy/test_llm_api_pytorch.py | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 6716
🏁 Script executed:
rg -n "class TestDeepSeekV3Lite" tests/integration/defs/accuracy/test_llm_api_pytorch.py -A 50 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 2945
🏁 Script executed:
sed -n '1403,1500p' tests/integration/defs/accuracy/test_llm_api_pytorch.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 5076
The test list is missing torch_compile_fast variants that the parametrization generates.
The parametrization in test_bfloat16 and test_nvfp4 creates three node ID variants (eager, torch_compile, torch_compile_fast), but this RTX6K list contains only eager and torch_compile entries. The torch_compile_fast variants will not run on RTX6K despite being generated by the code. Confirm whether this exclusion is intentional for this GPU target.
🤖 Prompt for AI Agents
In `@tests/integration/test_lists/qa/llm_function_rtx6k.txt` around lines 1 - 40,
The RTX6K test list is missing the parametrized "torch_compile_fast" node
variant produced by the test functions test_bfloat16 and test_nvfp4; either add
the corresponding "torch_compile_fast" entries for each parametrization to this
list or explicitly document/guard the parametrization in the tests to
skip/disable the fast compile variant on this GPU. Locate the parametrization
for TestDeepSeekV3Lite::test_bfloat16 and test_nvfp4 and either (A) append the
generated torch_compile_fast cases to the RTX6K list, or (B) change the test
parametrization or add a GPU-target check to prevent generating
torch_compile_fast for RTX6K and update the test list to match that behavior.
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_chunked_prefill[attn_backend=TRTLLM] TIMEOUT (90) | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_dummy_load_format | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding[xgrammar] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_with_eagle3[xgrammar-eagle3_one_model=True] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_with_eagle3[xgrammar-eagle3_one_model=False] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=False] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile=True] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=False] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile=True] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-eager] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast] | ||
| - accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_eagle3[sampler_async_worker=False-eagle3_one_model=False-overlap_scheduler=False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
# First check if files exist
if [ ! -f "defs/.test_durations" ]; then
echo "ERROR: .test_durations not found"
exit 1
fi
if [ ! -f "test_lists/test-db/l0_h100.yml" ]; then
echo "ERROR: l0_h100.yml not found"
exit 1
fi
# Run the verification script from the comment
python3 - <<'PY'
import json
import pathlib
dur_path = pathlib.Path("defs/.test_durations")
test_list_path = pathlib.Path("test_lists/test-db/l0_h100.yml")
if not dur_path.exists():
print(f"ERROR: {dur_path} does not exist")
exit(1)
if not test_list_path.exists():
print(f"ERROR: {test_list_path} does not exist")
exit(1)
try:
dur = json.load(open(dur_path))
except Exception as e:
print(f"ERROR loading durations: {e}")
exit(1)
missing = []
found = []
torch_compile_fast_lines = []
for line_num, raw in enumerate(test_list_path.read_text().splitlines(), 1):
line = raw.strip()
if not line or line.startswith("#"):
continue
if line.startswith("- "):
line = line[2:].strip()
if "torch_compile_fast" in line:
torch_compile_fast_lines.append((line_num, line))
if line in dur:
found.append(line)
else:
missing.append(line)
print(f"Total torch_compile_fast entries found: {len(torch_compile_fast_lines)}")
print(f"Entries in .test_durations: {len(found)}")
print(f"Missing from .test_durations: {len(missing)}")
print()
if missing:
print("MISSING ENTRIES:")
for item in missing[:5]: # Show first 5
print(f" {item}")
if len(missing) > 5:
print(f" ... and {len(missing) - 5} more")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 917
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
# Check the structure of .test_durations
echo "=== .test_durations file info ==="
wc -l defs/.test_durations
echo ""
echo "=== First few entries ==="
python3 -c "
import json
dur = json.load(open('defs/.test_durations'))
count = 0
for key in list(dur.keys())[:5]:
print(f'{key}: {dur[key]}')
count += 1
print(f'... Total entries in .test_durations: {len(dur)}')
"Repository: NVIDIA/TensorRT-LLM
Length of output: 590
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
python3 - <<'PY'
import json
import pathlib
dur_path = pathlib.Path("defs/.test_durations")
dur = json.load(open(dur_path))
# Check if base versions (without torch_compile_fast) exist
test_bases = [
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-eager]",
"accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-eager]",
"accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True]",
]
print("=== Checking base versions (without torch_compile_fast) ===")
for base in test_bases:
if base in dur:
print(f"✓ FOUND: {base}")
print(f" Duration: {dur[base]}")
else:
print(f"✗ NOT FOUND: {base}")
# Try partial matching
partial_matches = [k for k in dur if base.split("::")[1] in k and "torch_compile" not in k]
if partial_matches:
print(f" But found similar: {partial_matches[0] if partial_matches else 'none'}")
print()
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 717
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
# Check if .test_durations was modified or if l0_h100.yml is new
if [ -d ".git" ]; then
git status defs/.test_durations test_lists/test-db/l0_h100.yml 2>&1 | head -20
fi
# Also check for comments or patterns in the test_durations about torch_compile
if grep -i "torch_compile" defs/.test_durations > /dev/null 2>&1; then
echo "torch_compile found in .test_durations"
grep -i "torch_compile" defs/.test_durations
else
echo "torch_compile NOT found in .test_durations"
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 1458
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
# Check what variants exist in the actual test parametrization
python3 - <<'PY'
import pathlib
import re
test_file = pathlib.Path("accuracy/test_llm_api_pytorch.py")
if test_file.exists():
content = test_file.read_text()
# Look for torch_compile references
torch_compile_refs = re.findall(r'torch_compile[_\w]*', content)
unique_refs = sorted(set(torch_compile_refs))
print("torch_compile variants found in test file:")
for ref in unique_refs:
print(f" - {ref}")
else:
print("test_llm_api_pytorch.py not found")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 96
🏁 Script executed:
#!/bin/bash
cd tests/integration || exit 1
python3 - <<'PY'
import pathlib
test_list_path = pathlib.Path("test_lists/test-db/l0_h100.yml")
lines = test_list_path.read_text().splitlines()
# Lines mentioned: 52-63, 66-69, 80-81, 132-133
affected_ranges = [(52, 63), (66, 69), (80, 81), (132, 133)]
all_torch_compile_fast = []
for start, end in affected_ranges:
print(f"\n=== Lines {start}-{end} ===")
for i in range(start-1, min(end, len(lines))):
line = lines[i].strip()
if line and not line.startswith("#"):
if "torch_compile_fast" in line:
all_torch_compile_fast.append(line)
print(f"Line {i+1}: {line[:100]}...")
print(f"\n=== SUMMARY ===")
print(f"Total torch_compile_fast entries in affected ranges: {len(all_torch_compile_fast)}")
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 1003
Add missing torch_compile_fast entries to .test_durations.
The torch_compile_fast parametrization has been added to l0_h100.yml (7 entries across lines 52–63, 66–69, 80–81, 132–133), but none of these variants are present in .test_durations. Tests without duration estimates will use default timeouts, which may not reflect the actual runtime of this optimization variant and cause unexpected failures. Add duration entries for all torch_compile_fast variants or derive them from the corresponding eager or torch_compile baselines:
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast](line 133)
🤖 Prompt for AI Agents
In `@tests/integration/test_lists/test-db/l0_h100.yml` around lines 52 - 63, The
.test_durations file is missing entries for the new torch_compile_fast
parametrizations; add duration estimates for each listed test variant (or
copy/derive the value from the corresponding eager/torch_compile baseline) so
they don't use default timeouts: add entries for
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_bfloat16[attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=False-attn_backend=TRTLLM-torch_compile_fast],
::TestLlama3_1_8BInstruct::test_fp8[fp8kv=True-attn_backend=TRTLLM-torch_compile_fast],
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=disable-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
::TestDeepSeekV3Lite::test_fp8_block_scales[mtp=vanilla-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile_fast],
and ::TestQwen3_30B_A3B::test_fp8[latency-torch_compile_fast]; ensure naming in
.test_durations exactly matches these test ids and use the same units/format as
existing entries.
…samples of GSM8K
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.