-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[None][feat] Export ONNX for DriveOS LLM #10117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive ONNX export pipeline for LLMs via AutoDeploy, featuring new transforms for graph optimization, custom ONNX operations, configuration handling, and visualization capabilities to support DriveOS LLM deployment. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant AutoDeployConfig
participant Optimizer
participant Transforms as Transform<br/>Pipeline
participant ONNX as ONNX Export
participant Files as Output Files
User->>AutoDeployConfig: Create with mode=<br/>"export_driveos_llm_onnx"
AutoDeployConfig->>AutoDeployConfig: Load config from<br/>export_driveos_llm_onnx.yaml
User->>Optimizer: Build and optimize<br/>graph
rect rgb(220, 240, 255)
Note over Transforms: Transform Pipeline Execution
Optimizer->>Transforms: FuseRopeAttention<br/>→ Pattern match + fuse RoPE+attention
Transforms->>Transforms: Add context_lengths,<br/>rope_rotary_cos_sin placeholders
Optimizer->>Transforms: AdaptToDriveOSLLM<br/>→ Convert to float16, insert casts
Optimizer->>Transforms: GatherLastTokenIds<br/>→ Add token gathering
Optimizer->>Transforms: ShortReshapeAttentionOutput<br/>→ Optimize reshape nodes
end
Optimizer->>ONNX: ExportToONNX transform
rect rgb(240, 255, 240)
Note over ONNX: ONNX Export Phase
ONNX->>ONNX: Prepare dynamic shapes<br/>and placeholders
ONNX->>ONNX: Register custom ops<br/>(AttentionPlugin, GatherND, RoPE)
ONNX->>ONNX: Call torch.onnx.export<br/>with dynamo=True
end
ONNX->>Files: Export model.onnx
ONNX->>Files: Export config.json<br/>(via _config_export)
ONNX->>Files: Export processed_<br/>chat_template.json
ONNX->>Files: Copy tokenizer files<br/>(vocab, tokens, maps)
Files->>User: Return output directory<br/>with all artifacts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 20
🧹 Nitpick comments (22)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
9-12: Remove unused imports or clarify their purpose.The three new imports (
_detect_fake_mode_from_gm,FakeTensor,_extract_tensor_metadata) are not referenced anywhere in this file. Additionally, these are private PyTorch APIs (indicated by_prefix), which may change without notice in future PyTorch releases.If these imports are intended for near-term use in this file, consider adding a TODO comment explaining their purpose. Otherwise, remove them to keep the imports clean.
🔎 Apply this diff to remove the unused imports:
from pydantic import Field -from torch._export.utils import _detect_fake_mode_from_gm -from torch._subclasses import FakeTensor from torch.fx import GraphModule, Node -from torch.fx.passes.shape_prop import _extract_tensor_metadata from ...custom_ops.attention_interface import (tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)
1-1: Verify NVIDIA copyright header presence.According to coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header with the year of latest meaningful modification. No copyright header is visible at the start of this file. Since this file is being modified in 2025, please verify whether the header exists or needs to be added.
As per coding guidelines, all TensorRT-LLM OSS code requires an NVIDIA copyright header.
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (3)
72-80: Avoid direct tuple mutation.Line 78 directly assigns to
cast_node.args, which mutates a tuple in place. While this works for FX graph nodes, it's safer to use the providedupdate_argmethod for consistency and clarity.🔎 Apply this diff to use the update_arg method:
num_changed = 0 for cast_node in cast_nodes: if cast_node.args[1] == torch.bfloat16: - cast_node.args = (cast_node.args[0], torch.float16) + cast_node.update_arg(1, torch.float16) num_changed += 1 return num_changed
82-83: Clarify return type of_to_float16.The method signature suggests it returns
bool, but it callsgm.half()(which returnsNoneimplicitly). Consider either removing the return type annotation or explicitly returning a meaningful value.🔎 Apply this diff to fix the return type:
- def _to_float16(self, gm: GraphModule) -> bool: + def _to_float16(self, gm: GraphModule) -> None: gm.half()
16-28: Consider adding a docstring.The method performs non-trivial graph manipulation. A docstring would improve maintainability by documenting its purpose, the transformation it applies, and what it returns.
examples/auto_deploy/onnx_export_llm.py (1)
3-3: Remove unused import.The
onnxscript.opset22import is not used in this file.🔎 Apply this diff to remove the unused import:
import argparse -from onnxscript import opset22 as opset22 - from tensorrt_llm._torch.auto_deploy import LLM, AutoDeployConfigtests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
12-12: Consider usingtempfile.mkdtemp()for test output directory.The hardcoded
/tmp/test_ad_export_onnx_qwen2.5-0.5bpath may cause issues in environments where/tmpis not available or on Windows. Consider using Python'stempfilemodule for more portable tests.🔎 Example using tempfile:
import tempfile import shutil @pytest.mark.parametrize( "model, max_batch_size, max_seq_len, num_attn_ops", [ ("Qwen/Qwen2.5-0.5B", 13, 4, 24), ], ) def test_ad_export_onnx( model: str, max_batch_size: int, max_seq_len: int, num_attn_ops: int ): output_dir = tempfile.mkdtemp(prefix="test_ad_export_onnx_") try: ad_config = AutoDeployConfig( model=model, mode="export_driveos_llm_onnx", max_batch_size=max_batch_size, max_seq_len=max_seq_len, ) ad_config.transforms["export_to_onnx"]["output_dir"] = output_dir # ... rest of test finally: shutil.rmtree(output_dir, ignore_errors=True)tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)
1-3: Missing NVIDIA copyright header.Per the coding guidelines, all TensorRT-LLM Open Source Software code (including
.pyfiles referenced by.yamlconfigs) should contain an NVIDIA copyright header. While YAML config files may have different requirements, consider adding a copyright header for consistency with other configuration files in the repository.🔎 Suggested header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # This is the set of transforms running in "graph" mode. In this mode, we capture the full graph # of the model and optimize it for inference. transforms:tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)
14-39: Typo in method name: "accending" → "ascending".🔎 Fix the typo:
- def _lookup_accending_node(self, node: Node, target, max_depth: int = 3) -> Node: + def _lookup_ascending_node(self, node: Node, target, max_depth: int = 3) -> Node: if max_depth == 0: return None if node.target == target: return node # Helper function to check a single node def check_node(n): if isinstance(n, Node): - result = self._lookup_accending_node(n, target, max_depth - 1) + result = self._lookup_ascending_node(n, target, max_depth - 1) if result is not None: return result return NoneAlso update the call sites at lines 49-50 and 56-57.
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (3)
514-521: Broad exception handling may hide errors.Catching bare
Exceptioncan mask underlying issues. Consider catching more specific exceptions (e.g.,graphviz.ExecutableNotFound,OSError).🔎 Suggested fix:
try: dot.render(save_path, cleanup=True) print(f"✅ Diagram saved: {save_path}.{format}") with open(save_path + ".txt", "w") as f: f.write(str(graph_module.graph)) - except Exception as e: + except (OSError, IOError) as e: print(f"❌ Failed to save diagram: {e}")
386-399: Remove commented-out debug code.These commented-out print statements appear to be debugging artifacts. Consider removing them for cleaner code.
🔎 Clean up:
- # Print 10 nodes with most inputs - # print("Nodes with most inputs:") - # node_inputs_sorted = sorted(node_inputs.items(), key=lambda x: len(x[1]), reverse=True) - # for node_name, input_list in node_inputs_sorted[:10]: - # print(f" {node_name}: {len(input_list)}") - # Print 10 nodes with most outputs node_outputs_sorted = sorted(node_outputs.items(), key=lambda x: len(x[1]), reverse=True) - # print("Nodes with most outputs:") large_fanout_nodes: Dict[str, int] = {} for node_name, output_list in node_outputs_sorted[:10]: if len(output_list) > 10: large_fanout_nodes[node_name] = 0 - # print(f" {node_name}: {len(output_list)}")
617-636: Narrow the exception handling.The bare
Exceptioncatch could hide unexpected errors. Consider catchingAttributeErrorspecifically sinceget_submodulemay fail if the target path doesn't exist.🔎 Suggested fix:
try: # Try to get actual module type name actual_module = graph_module.get_submodule(str(target)) module_type = actual_module.__class__.__name__ ... - except Exception: + except (AttributeError, KeyError): # If unable to get module, fall back to original logic module_name = str(target).split(".")[-1] if "." in str(target) else str(target) return module_nametensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1)
54-57: Use logging instead of print for warnings.The codebase appears to use
ad_loggerfor logging in other transform files. Consider using the logger for consistency and better control over log levels.🔎 Suggested approach:
+from ...utils.logger import ad_logger ... - print( - "Warning: head_dim not found in config, calculating as hidden_size // num_attention_heads" - ) + ad_logger.warning( + "head_dim not found in config, calculating as hidden_size // num_attention_heads" + )Apply the same pattern at lines 92-94, 128-130, and 146-149.
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (2)
160-171: Consider narrowing the exception type.The
except Exceptionclause is overly broad. Since you're accessing shape info from tensor metadata, consider catching more specific exceptions likeAttributeError,TypeError, orIndexError.🔎 Suggested fix:
- except Exception as e: + except (AttributeError, TypeError, IndexError) as e: ad_logger.error(f" Skipping: failed to extract head_dim: {e}") continue
176-196: Consider narrowing the exception type.Similar to the previous block, this
except Exceptionclause should catch specific exceptions for tensor metadata access.🔎 Suggested fix:
- except Exception as e: + except (AttributeError, TypeError, IndexError) as e: ad_logger.error(f" Skipping: failed to calculate num_heads: {e}") continuetensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (3)
48-59: Use logger instead of print and clarify message.The function uses
print()statements and the message "Set use_prompt_tuning" seems unrelated to theis_vlmcheck. Consider usingad_loggerand updating the message to reflect the actual VLM detection.🔎 Suggested fix:
+from ...utils.logger import ad_logger + def is_vlm(model_dir: str) -> bool: """Check if the model is a VLM.""" cfg = AutoConfig.from_pretrained(model_dir, trust_remote_code=True) cfg_dict = cfg.to_dict() has_vision = "vision_config" in cfg_dict has_phi4_vision = "image_embd_layer" in cfg_dict.get("embd_layer", {}) if has_vision or has_phi4_vision: - print("Set use_prompt_tuning to True") + ad_logger.debug("Detected VLM model (has vision config)") return True else: - print("Set use_prompt_tuning to False") + ad_logger.debug("Detected text-only LLM model") return False
124-154: Consider logging the first exception for debugging.The first
except Exceptionsilently falls through to the fallback. Consider logging the original exception to help debug cases where neither path works.🔎 Suggested fix:
try: # Convert dataclass messages to dictionaries using asdict message_dicts = [asdict(msg) for msg in messages] return tokenizer.apply_chat_template( message_dicts, tokenize=False, add_generation_prompt=add_generation_prompt ) - except Exception: + except Exception as e1: # Try fallback: convert list content to string for tokenizers that don't support multimodal try:
315-336: Use logger for consistency with the codebase.Multiple
print()calls should usead_loggerfor consistency with other modules in the codebase.tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (4)
411-424: Use logger instead of print and remove unused args list.The
argslist is created but always empty since all placeholders go intokwargs. Also,print()should be replaced withad_logger.🔎 Suggested fix:
- args = [] kwargs = {} placeholders = gm.graph.find_nodes(op="placeholder") for ph in placeholders: kwargs[ph.name] = ph.meta["val"] - args = tuple(args) - print("Placeholders args:") - for i, e in enumerate(args): - print(f" {i}: {placeholders[i].name:20} {e}") - - print("Placeholders kwargs:") + ad_logger.debug("Placeholders kwargs:") for k, v in kwargs.items(): - print(f" {k}: {v}") + ad_logger.debug(f" {k}: {v}")
440-461: Hardcoded magic numbers for dynamic shape bounds.The values
16forrope_batch_sizeand4096formax_position_embeddings/past_lenare hardcoded. Consider deriving these from model config or making them configurable.🔎 Suggested deriving from config:
+ max_position = getattr(gm.config, 'max_position_embeddings', 4096) + dynamic_shapes["rope_rotary_cos_sin"] = { - 0: Dim("rope_batch_size", min=1, max=16), - 1: Dim("max_position_embeddings", min=1, max=4096), + 0: Dim("rope_batch_size", min=1, max=cm.info.max_batch_size), + 1: Dim("max_position_embeddings", min=1, max=max_position), } # ... for i in range(num_layers): dynamic_shapes[f"past_key_values_{i}"] = { - 3: Dim("past_len", min=1, max=4096), + 3: Dim("past_len", min=1, max=max_position), }
496-511: Remove empty args tuple and dead comment.Since
argsis always empty, passingtuple(args)is unnecessary. The commented line 508 should be removed.🔎 Suggested fix:
torch.onnx.export( gm, - tuple(args), + (), output_path, opset_version=20, kwargs=kwargs, dynamo=True, dynamic_shapes=dynamic_shapes, report=False, output_names=output_names, custom_translation_table=custom_translation_table, ) - # export_output.save(output_path) ad_logger.info(f"Successfully exported ONNX model to {output_path}") return True
513-525: Use logger instead of print.Replace
print()calls withad_loggerfor consistency.🔎 Suggested fix:
if reduced_vocab_size is not None: model_config["reduced_vocab_size"] = reduced_vocab_size - print(f"Added reduced_vocab_size={reduced_vocab_size} to config") + ad_logger.info(f"Added reduced_vocab_size={reduced_vocab_size} to config") config_path = os.path.join(output_dir, "config.json") with open(config_path, "w") as f: json.dump(model_config, f, indent=2) - print(f"Model configuration saved to {config_path}") + ad_logger.info(f"Model configuration saved to {config_path}")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
.gitignore(1 hunks)docker/common/install_base.sh(1 hunks)examples/auto_deploy/onnx_export_llm.py(1 hunks)requirements.txt(1 hunks)tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml(1 hunks)tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx_debug.yaml(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py(1 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/optimizer.py(2 hunks)tensorrt_llm/_torch/auto_deploy/utils/_graph.py(4 hunks)tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py(4 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/auto_deploy/transform/optimizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.pytensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/auto_deploy/transform/optimizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.pytensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
🧠 Learnings (10)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
🧬 Code graph analysis (13)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
AutoDeployConfig(54-339)to_llm_kwargs(315-325)
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (4)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
ModelFactory(94-346)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-92)tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
run_shape_prop(223-248)tensorrt_llm/_torch/auto_deploy/transform/interface.py (6)
BaseTransform(220-507)SharedConfig(62-69)TransformInfo(124-181)TransformRegistry(510-538)register(516-523)_apply(482-495)
examples/auto_deploy/onnx_export_llm.py (1)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
to_llm_kwargs(315-325)
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (9)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
ModelFactory(94-346)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-92)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op(198-221)tensorrt_llm/_torch/auto_deploy/transform/interface.py (3)
BaseTransform(220-507)TransformInfo(124-181)_apply(482-495)tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)
target(602-603)tensorrt_llm/functional.py (1)
replace_all_uses_with(556-573)tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1)
_apply(119-159)tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
_apply(156-241)tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)
_apply(66-113)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (6)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-92)tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)
add_graph_input(251-315)add_graph_output(350-434)remove_graph_input(502-561)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op(198-221)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
torch_attention(96-212)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
call_function(249-276)tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1)
AttentionPlugin(9-25)
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (6)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
get(526-528)tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)
target(602-603)tensorrt_llm/_utils.py (1)
numel(1002-1003)cpp/tensorrt_llm/thop/alltoallOp.cpp (1)
output_list(73-73)docker/common/install_base.sh (1)
cleanup(23-44)cpp/kernels/xqa/mha_sm90.cu (1)
tokens(529-532)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field(68-95)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
torch_attention(96-212)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm(276-344)tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
InferenceOptimizer(24-94)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
tensorrt_llm/_torch/autotuner.py (1)
FakeTensor(154-157)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)
tests/unittest/llmapi/apps/test_chat_utils.py (1)
chat_template_path(188-193)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
get(526-528)tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
to_dict(311-313)to_dict(446-451)tensorrt_llm/bench/benchmark/__init__.py (1)
model_type(70-71)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
args(28-30)
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)
tensorrt_llm/builder.py (1)
default(45-50)
🪛 Ruff (0.14.8)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
122-122: Unused method argument: cm
(ARG002)
123-123: Unused method argument: factory
(ARG002)
124-124: Unused method argument: shared_config
(ARG002)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
12-12: Probable insecure usage of temporary file or directory: "/tmp/test_ad_export_onnx_qwen2.5-0.5b"
(S108)
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
69-69: Unused method argument: cm
(ARG002)
70-70: Unused method argument: factory
(ARG002)
71-71: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
88-88: Unused method argument: cm
(ARG002)
89-89: Unused method argument: factory
(ARG002)
90-90: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
169-169: Do not catch blind exception: Exception
(BLE001)
194-194: Do not catch blind exception: Exception
(BLE001)
216-216: Unused method argument: cm
(ARG002)
281-281: Unused method argument: cm
(ARG002)
381-381: Consider (input_ids_node, *node.args[1:]) instead of concatenation
Replace with (input_ids_node, *node.args[1:])
(RUF005)
390-390: Unused method argument: factory
(ARG002)
391-391: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
77-77: Consider moving this statement to an else block
(TRY300)
191-193: Avoid specifying long messages outside the exception class
(TRY003)
433-433: Loop control variable target_name not used within loop body
(B007)
433-433: Loop control variable input_idx not used within loop body
(B007)
519-519: Do not catch blind exception: Exception
(BLE001)
632-632: Do not catch blind exception: Exception
(BLE001)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
408-408: Unused method argument: factory
(ARG002)
409-409: Unused method argument: shared_config
(ARG002)
530-530: Unused method argument: cm
(ARG002)
532-532: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
13-13: Unused function argument: context_lengths
(ARG001)
14-14: Unused function argument: rope_rotary_cos_sin
(ARG001)
15-15: Unused function argument: kvcache_start_index
(ARG001)
17-17: Unused function argument: enable_tree_attention
(ARG001)
18-18: Unused function argument: head_size
(ARG001)
19-19: Unused function argument: num_kv_heads
(ARG001)
20-20: Unused function argument: num_q_heads
(ARG001)
32-32: Unused function argument: context_lengths
(ARG001)
33-33: Unused function argument: rope_rotary_cos_sin
(ARG001)
34-34: Unused function argument: kvcache_start_index
(ARG001)
35-35: Unused function argument: enable_tree_attention
(ARG001)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
61-61: Unused method argument: position_ids
(ARG002)
142-142: Unused function argument: atol
(ARG001)
143-143: Unused function argument: rtol
(ARG001)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
131-131: Do not catch blind exception: Exception
(BLE001)
149-154: Avoid specifying long messages outside the exception class
(TRY003)
251-251: Avoid specifying long messages outside the exception class
(TRY003)
257-257: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
257-257: Avoid specifying long messages outside the exception class
(TRY003)
261-263: Prefer TypeError exception for invalid type
(TRY004)
261-263: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
330-330: Do not catch blind exception: Exception
(BLE001)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
38-38: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
85-85: Avoid specifying long messages outside the exception class
(TRY003)
121-121: Avoid specifying long messages outside the exception class
(TRY003)
135-135: Avoid specifying long messages outside the exception class
(TRY003)
164-166: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
371-371: Avoid specifying long messages outside the exception class
(TRY003)
382-382: Consider (*tuple(current_outputs), output_node) instead of concatenation
Replace with (*tuple(current_outputs), output_node)
(RUF005)
466-469: Avoid specifying long messages outside the exception class
(TRY003)
479-482: Avoid specifying long messages outside the exception class
(TRY003)
486-489: Avoid specifying long messages outside the exception class
(TRY003)
495-497: Avoid specifying long messages outside the exception class
(TRY003)
540-544: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (35)
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)
59-59: LGTM! Consistent transition to explicit .default dispatcher.The addition of
.defaultto all torch op invocations correctly uses PyTorch's dispatcher pattern to explicitly call the default implementation. The changes are mechanical, consistent across all four call sites, and maintain identical arguments and behavior.Also applies to: 100-100, 151-151, 188-188
.gitignore (1)
3-3: LGTM!Adding
.cursorfollows the existing pattern for ignoring IDE-specific directories (.vscode,.idea).docker/common/install_base.sh (1)
56-56: LGTM!Adding system-level
graphvizcorrectly supports the Pythongraphvizpackage for graph visualization in the ONNX export workflow.tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)
257-294: LGTM!The
name_prefixparameter is consistently applied across the placeholder name,orig_args, andcontext, maintaining backward compatibility with the default"arg_"prefix.
437-499: LGTM!Robust helper with clear validation and descriptive error messages for both integer index and Node inputs. The detailed exception messages aid debugging despite the static analysis hints.
392-395: Incorrectcontextinitialization for single-output conversion.When converting a single-output graph to a tuple spec,
context=["output"]creates a mismatch - there's one child spec (_LEAF_SPEC) but the context name doesn't correspond to it properly. This could cause issues when later appending toout_spec.contextat line 399.🔎 Suggested fix
if out_spec == _LEAF_SPEC: - new_out_spec = TreeSpec(type=tuple, children_specs=[_LEAF_SPEC], context=["output"]) + new_out_spec = TreeSpec(type=tuple, children_specs=[_LEAF_SPEC, _LEAF_SPEC], context=["output", name]) graph._codegen.pytree_info = graph._codegen.pytree_info._replace(out_spec=new_out_spec) out_spec = graph._codegen.pytree_info.out_spec + # Already added both specs, skip the append below + object.__setattr__(out_spec, "type", dict) + returnAlternatively, if the intent is to just wrap the existing output, ensure the context reflects the original output name.
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1)
8-48: LGTM!The
AttentionPlugincustom op and its fake implementation correctly define the ONNX export placeholder with proper shape inference. The unused arguments (flagged by static analysis) are expected for signature matching with the target ONNX op.tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1)
85-105: Unused method arguments flagged by static analysis.The parameters
cm,factory, andshared_configare part of the BaseTransform interface but unused in this implementation. This is acceptable if they're not needed for this transform, but consider documenting why they're unused.Based on learnings and the BaseTransform interface pattern, these parameters are part of the standard transform signature and may be used by other transforms. The unused parameters here are acceptable.
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
52-52: LGTM!The new EXPORT_ONNX stage is correctly positioned between VISUALIZE and COMPILE, which makes sense for the export workflow. The enum ordering is preserved for stage comparison logic.
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
13-13: LGTM!The addition of graph visualization capability via
to_dotis well-implemented and appropriately gated behind an environment variable for debugging purposes.tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
154-154: LGTM!The addition of the "export_driveos_llm_onnx" mode and its YAML mapping follows the existing pattern for mode configuration. The integration is consistent with the broader ONNX export workflow introduced in this PR.
Also applies to: 337-337
examples/auto_deploy/onnx_export_llm.py (1)
31-35: Document the batch_size workaround.The comment reveals unexpected behavior where
max_batch_size=2causes static shape collapse whilemax_batch_size=13enables dynamic shapes. This suggests a potential underlying issue with dynamic shape handling that warrants investigation or proper documentation.Could you clarify whether this is:
- Expected behavior that should be documented in user-facing docs?
- A known limitation/bug that should be tracked in an issue?
- A temporary workaround that will be fixed?
If this is a known limitation, consider adding a more detailed explanation or reference to an issue tracker.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (2)
42-45: LGTM!The test comprehensively validates the ONNX export output, including file existence, operator counts, and input/output structure. The assertions properly verify the expected graph structure with AttentionPlugin nodes and named inputs/outputs.
1-1: Add NVIDIA copyright header.All TensorRT-LLM files should include an NVIDIA copyright header.
As per coding guidelines:
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
🔎 Apply this diff to add the copyright header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modificationLearnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.Learnt from: tburt-nv Repo: NVIDIA/TensorRT-LLM PR: 9881 File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36 Timestamp: 2025-12-10T19:14:58.432Z Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all capstests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)
136-144: Unused test parameters are acceptable.The static analysis flags
position_ids,atol, andrtolas unused. However:
position_idsis part of the model's forward signature and may be used in future test variationsatol/rtolparameters are defined for potential numerical comparison testingThese are acceptable for test infrastructure flexibility.
Also applies to: 192-219
1-1: Add NVIDIA copyright header.All TensorRT-LLM files should include an NVIDIA copyright header.
As per coding guidelines:
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
🔎 Apply this diff to add the copyright header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modificationLearnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.Learnt from: tburt-nv Repo: NVIDIA/TensorRT-LLM PR: 9881 File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36 Timestamp: 2025-12-10T19:14:58.432Z Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all capsLearnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-12-17T22:39:44.230Z Learning: Applies to **/*.h : The preprocessor guard name in C++ must have prefix `TRTLLM_` followed by the filename, all in caps. Only use the file name, not directory names
84-86: Address RoPE argument ordering discrepancy in torch_rope_with_explicit_cos_sin call.The function signature in
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_rope.pyclearly definestorch_apply_rope_with_explicit_cos_sin(q, k, cos, sin, unsqueeze_dim)with return order(q_embed, k_embed). However, line 86 passes arguments as(k, q, ...)instead of(q, k, ...), which causes the rotations to be swapped. The reference implementation intensorrt_llm/_torch/auto_deploy/transform/library/rope.pyconfirms the expected order is(q, k). Either fix the argument order to(q, k, cos, sin, 2)or verify the receiving variables should bek_rot, q_rot = ...if the swap is intentional.tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx_debug.yaml (1)
1-144: LGTM!The debug configuration is well-structured and appropriately disables performance optimizations to focus on faster export and debugging. The comments clearly explain why transforms are disabled, and the remaining transforms support the core ONNX export workflow.
The structure mirrors the main export config while providing a streamlined path for debugging purposes.
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)
4-148: LGTM!The transform configuration is well-organized with clear section headers, appropriate staging, and documented TODOs for disabled features. The workflow from build through ONNX export is logically structured.
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (2)
41-64: LGTM!The pattern matching logic correctly identifies reshape nodes that are fed by AttentionPlugin and connected to torch_linear_simple nodes.
66-113: LGTM!The transform logic is well-structured:
- Correctly extracts dynamic dimensions using
sym_size.int- Properly inserts new reshape nodes before the old ones
- Replaces usages and erases old nodes cleanly
- Runs appropriate graph cleanup (shape prop, DCE, lint, recompile)
The unused
cm,factory, andshared_configarguments are required by theBaseTransform._applyinterface signature.tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (3)
15-38: LGTM!Good validation that the last linear simple node is actually producing the output logits, with appropriate warning if the assumption doesn't hold.
80-117: LGTM!The GatherND insertion logic is correct and well-documented. The unsqueeze is appropriately applied to prepare indices for the GatherND operation.
119-159: LGTM!The transform flow is well-structured with proper early returns for error cases and appropriate graph cleanup sequence.
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (2)
46-164: LGTM!The helper functions for edge width calculation, color assignment, and port mapping are well-implemented with appropriate handling of edge cases (single input/output, many inputs/outputs).
638-671: LGTM!The graph depth calculation with memoization is correctly implemented.
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)
1-18: LGTM!Good to see the proper NVIDIA copyright header included.
68-154: LGTM!The EAGLE configuration export logic is well-structured with appropriate field validation. The
base_model_hidden_size = target_hidden_size * 3logic is documented with a reference to the original llm_export.py.
157-202: LGTM!The public export functions properly validate inputs and dispatch to the appropriate internal handlers. The
export_llm_configfunction correctly handles multimodal models by extractingtext_config.tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (3)
15-47: LGTM!The
MatchResultcontainer is well-structured with clear attribute names and a useful__repr__for debugging.
216-276: LGTM!The placeholder augmentation logic correctly infers batch and sequence dimensions from the token placeholder and creates appropriately typed placeholders for the attention plugin inputs.
386-434: LGTM!The
_applymethod properly orchestrates the transformation: pattern matching, placeholder addition, replacement, cleanup, and graph validation. The early return for zero matches is appropriate.tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)
1-14: LGTM!Proper NVIDIA copyright header is present.
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (2)
46-128: LGTM!The ONNX custom op schemas and translation functions are well-defined with proper type constraints and documentation.
548-574: LGTM!The
_applymethod properly orchestrates the export workflow: directory creation, JSON file export, and ONNX model export with appropriate status tracking.
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
Show resolved
Hide resolved
...ittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
Outdated
Show resolved
Hide resolved
3fd31c8 to
249cdb1
Compare
7c87b13 to
b2fc9a2
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1)
54-58: Docstring shape still inconsistent with implementation.The docstring states
int64[batch_size]but the implementation creates a tensor with shape(batch_size, 2)at line 84. Update the docstring to reflect the actual shape.🔎 Suggested fix:
def _add_last_token_ids_input(self, gm: GraphModule) -> Node: """Add last_token_ids as a graph input. - Shape: int64[batch_size] + Shape: int64[batch_size, 2] - indices for GatherND operation """tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (2)
62-66: Union type syntax requires Python 3.10+.The
str | List[Dict[str, str]]syntax requires Python 3.10+, but per coding guidelines, code should conform to Python 3.8+. UseUnionfromtypingfor compatibility.🔎 Suggested fix:
-from typing import Any, Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional, Tuple, Union @dataclass class Message: role: str - content: str | List[Dict[str, str]] = field(default_factory=list) + content: Union[str, List[Dict[str, str]]] = field(default_factory=list)
253-298: Missing return statement and exception chaining.The function declares
-> Dict[str, Any]return type but doesn't return the validated template. Also, the exception at line 257 should chain the original exception.🔎 Suggested fix:
try: with open(chat_template_path, "r") as f: template = json.load(f) except json.JSONDecodeError as e: - raise ValueError(f"Invalid JSON in chat template file: {e}") + raise ValueError(f"Invalid JSON in chat template file: {e}") from e # ... validation logic ... print("Chat template validation successful!") + return templatetensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1)
118-137: Return type annotation mismatch and direct args mutation.
- The method is annotated as
-> boolbut returnsnum_changed(anint).- Line 135 directly mutates
cast_node.argstuple. Useupdate_argfor safer FX graph manipulation.🔎 Suggested fix:
- def _change_cast_bfloat16_to_float16(self, gm: GraphModule) -> bool: + def _change_cast_bfloat16_to_float16(self, gm: GraphModule) -> int: """Replace all bfloat16 cast operations with float16 casts. ... """ graph = gm.graph cast_nodes = graph.find_nodes(op="call_function", target=torch.ops.aten.to.dtype) num_changed = 0 for cast_node in cast_nodes: if cast_node.args[1] == torch.bfloat16: - cast_node.args = (cast_node.args[0], torch.float16) + cast_node.update_arg(1, torch.float16) num_changed += 1 return num_changed
🧹 Nitpick comments (12)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (2)
54-57: Consider usingad_loggerinstead ofOther transforms in this PR use
ad_loggerfor logging. Usingprint()here is inconsistent and won't integrate with the logging configuration.🔎 Suggested fix:
+from ...utils.logger import ad_logger + # In the function: - print( - "Warning: head_dim not found in config, calculating as hidden_size // num_attention_heads" - ) + ad_logger.warning( + "head_dim not found in config, calculating as hidden_size // num_attention_heads" + )
68-102: Consider extracting common config export logic.
_export_native_llm_configand_export_eagle_base_configshare nearly identical logic for required field validation, head_dim handling, and partial_rotary_factor defaults. This duplication could be reduced with a shared helper.tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)
317-336: Consider cachingis_vlmresult to avoid duplicate calls.
is_vlm(model_dir)is called at line 319 and again at line 371. Since it loads the model config, caching the result would improve efficiency.🔎 Suggested fix:
+ is_vlm_model = is_vlm(model_dir) + tokenizer = None loaders = ( - [AutoProcessor, AutoTokenizer] if is_vlm(model_dir) else [AutoTokenizer, AutoProcessor] + [AutoProcessor, AutoTokenizer] if is_vlm_model else [AutoTokenizer, AutoProcessor] ) # ... later in the function ... - if is_vlm(model_dir): + if is_vlm_model:tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
9-17: Use pytest's tmpdir fixture instead of hardcoded /tmp path.Hardcoded temporary paths can cause issues with concurrent test execution and on systems with restricted /tmp access. The pytest tmpdir fixture provides isolated temporary directories for each test.
🔎 Refactor to use tmpdir fixture
@pytest.mark.parametrize( - "model, max_batch_size, max_seq_len, output_dir, num_attn_ops", + "model, max_batch_size, max_seq_len, num_attn_ops", [ - ("Qwen/Qwen2.5-0.5B", 13, 4, "/tmp/test_ad_export_onnx_qwen2.5-0.5b", 24), + ("Qwen/Qwen2.5-0.5B", 13, 4, 24), ], ) def test_ad_export_onnx( - model: str, max_batch_size: int, max_seq_len: int, output_dir: str, num_attn_ops: int + model: str, max_batch_size: int, max_seq_len: int, num_attn_ops: int, tmp_path ): + output_dir = str(tmp_path / "test_ad_export_onnx_qwen2.5-0.5b") ad_config = AutoDeployConfig(tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (2)
77-81: Remove unused parameter or document why it's kept.The
position_idsparameter is declared but never used in theforwardmethod. Either remove it from the signature or add a comment explaining why it's present (e.g., for API compatibility).🔎 Option 1: Remove unused parameter
def forward( self, input_ids: torch.Tensor, - position_ids: torch.Tensor, ) -> torch.Tensor:🔎 Option 2: Document why it's kept
def forward( self, input_ids: torch.Tensor, - position_ids: torch.Tensor, + position_ids: torch.Tensor, # Required for export API compatibility ) -> torch.Tensor:
155-163: Remove unused tolerance parameters.The
atolandrtolparameters are declared but never used in_run_test. These appear to be leftover from a previous implementation.🔎 Remove unused parameters
def _run_test( head_dim: int, num_q_heads: int, num_kv_heads: int, batch_size: int, seq_len: int, - atol: float = 1e-3, - rtol: float = 1e-3, ):And update the caller:
_run_test( head_dim=head_dim, num_q_heads=num_q_heads, num_kv_heads=num_kv_heads, batch_size=batch_size, seq_len=seq_len, - atol=1e-2, - rtol=1e-2, )tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
359-438: Frozen dataclass mutation is acceptable given the constraints.The use of
object.__setattr__on line 437 to mutateout_spec.typeis unconventional but justified by the detailed comment (lines 420-437). The ONNX export requires dict-type outputs, but the frozen TreeSpec prevents normal assignment. This is a pragmatic solution to a real constraint.Consider documenting this behavior in the function docstring as well for better visibility.
🔎 Add note to docstring about type mutation
def add_graph_output(gm: GraphModule, output_node: Node, name: str) -> None: """Add a graph output to the given GraphModule. This function appends a new output to the graph's output node and updates the pytree_info metadata accordingly. + + Note: The output spec type is forcibly changed to dict to support arbitrary + named outputs, which is required for ONNX export with custom output names. NOTE: function does NOT do any graph canonicalization. This is left to the user!tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)
124-130: Unused parameters are likely required by interface.The parameters
cm,factory, andshared_configare unused in_applybut are likely required by theBaseTransforminterface signature. This is acceptable and follows the interface contract.If these parameters are truly optional in the base class, consider adding a comment explaining they're not needed for this transform, or use underscore-prefixed names (
_cm,_factory,_shared_config) to indicate they're intentionally unused.tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (1)
500-505: Minor: Use tuple unpacking for clarity.Consider using tuple unpacking instead of concatenation for slightly more idiomatic Python.
🔎 Suggested fix:
for node in sym_size_int_nodes: if node.args[0] == position_ids_node: - new_args = (input_ids_node,) + node.args[1:] + new_args = (input_ids_node, *node.args[1:]) node.args = new_argstensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (3)
432-438: Usead_loggerinstead ofThe codebase uses
ad_loggerfor logging throughout. Consider replacing thesead_logger.debugfor consistency with the rest of the module.🔎 Suggested fix:
- print("Placeholders args:") + ad_logger.debug("Placeholders args:") for i, e in enumerate(args): - print(f" {i}: {placeholders[i].name:20} {e}") + ad_logger.debug(f" {i}: {placeholders[i].name:20} {e}") - print("Placeholders kwargs:") + ad_logger.debug("Placeholders kwargs:") for k, v in kwargs.items(): - print(f" {k}: {v}") + ad_logger.debug(f" {k}: {v}")
454-470: Consider making hardcoded dimension limits configurable.The hardcoded max values for
rope_batch_size(16),max_position_embeddings(4096), andpast_len(4096) may be limiting for some models. Consider deriving these from model configuration or making them configurable.Verify that these limits are sufficient for target models. If models require larger values, these hardcoded limits would cause export failures or runtime errors.
534-539: Usead_loggerinstead ofSimilar to the earlier observation, these
ad_loggerfor consistent logging.🔎 Suggested fix:
if reduced_vocab_size is not None: model_config["reduced_vocab_size"] = reduced_vocab_size - print(f"Added reduced_vocab_size={reduced_vocab_size} to config") + ad_logger.info(f"Added reduced_vocab_size={reduced_vocab_size} to config") config_path = os.path.join(output_dir, "config.json") with open(config_path, "w") as f: json.dump(model_config, f, indent=2) - print(f"Model configuration saved to {config_path}") + ad_logger.info(f"Model configuration saved to {config_path}")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
.gitignore(1 hunks)docker/common/install_base.sh(1 hunks)examples/auto_deploy/onnx_export_llm.py(1 hunks)requirements.txt(1 hunks)tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py(1 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/interface.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/optimizer.py(3 hunks)tensorrt_llm/_torch/auto_deploy/utils/_graph.py(4 hunks)tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py(4 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
- tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
- .gitignore
- docker/common/install_base.sh
- tensorrt_llm/_torch/auto_deploy/transform/interface.py
- tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.pytensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.pytensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
🧠 Learnings (31)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-12-19T06:31:54.973Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yamltests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-12-17T22:39:44.244Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
📚 Learning: 2025-10-22T06:53:47.017Z
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
📚 Learning: 2025-12-10T19:14:58.432Z
Learnt from: tburt-nv
Repo: NVIDIA/TensorRT-LLM PR: 9881
File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36
Timestamp: 2025-12-10T19:14:58.432Z
Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.pytensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
📚 Learning: 2025-12-17T22:39:44.244Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all caps
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pyexamples/auto_deploy/onnx_export_llm.py
📚 Learning: 2025-12-17T22:39:44.244Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
📚 Learning: 2025-12-17T22:39:44.244Z
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.h : The preprocessor guard name in C++ must have prefix `TRTLLM_` followed by the filename, all in caps. Only use the file name, not directory names
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pyexamples/auto_deploy/onnx_export_llm.py
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pyexamples/auto_deploy/onnx_export_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-11-12T17:28:52.144Z
Learnt from: cheshirekow
Repo: NVIDIA/TensorRT-LLM PR: 9016
File: 3rdparty/README.md:20-20
Timestamp: 2025-11-12T17:28:52.144Z
Learning: In the TensorRT-LLM repository, "nspect" is the name of an internal tool used for detecting package installations in containers. It should not be flagged as a typo.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
examples/auto_deploy/onnx_export_llm.py
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
examples/auto_deploy/onnx_export_llm.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-12-12T10:07:36.866Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:36.866Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do NOT require a register_fake decorator. The mutation tracking is handled automatically without needing a FakeTensor kernel, as documented in the PyTorch tutorial on mutable Python custom operators.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
📚 Learning: 2025-12-19T06:31:46.370Z
Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:46.370Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, ensure that the qkv_node is cast to torch.float16 before creating the AttentionPlugin. This casting is intentional and required because DriveOS LLM expects float16 dtype. Do not revert to the original dtype, and do not make this behavior configurable for bfloat16 models within the DriveOS LLM ONNX export path. If needed, document the rationale in code comments and add a direct assertion to prevent accidental changes to the dtype before plugin creation.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-08-09T02:04:49.623Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-27T15:03:57.149Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
📚 Learning: 2025-08-27T14:23:55.566Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
args(28-30)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-92)tensorrt_llm/_torch/auto_deploy/utils/_graph.py (2)
add_graph_input(265-324)run_shape_prop(237-262)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
AutoDeployConfig(54-339)to_llm_kwargs(315-325)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
get(526-528)tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
to_dict(311-313)to_dict(446-451)tensorrt_llm/bench/benchmark/__init__.py (1)
model_type(70-71)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (3)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)
process_chat_template(300-443)tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1)
export_llm_config(175-203)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
get_config_class(122-123)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (4)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
to_dict(311-313)to_dict(446-451)tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
get(526-528)tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)
placeholder(229-247)tests/unittest/llmapi/apps/test_chat_utils.py (1)
chat_template_path(188-193)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
torch_attention(96-212)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm(276-344)tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
InferenceOptimizer(25-93)
🪛 Ruff (0.14.8)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
380-380: Avoid specifying long messages outside the exception class
(TRY003)
391-391: Consider (*tuple(current_outputs), output_node) instead of concatenation
Replace with (*tuple(current_outputs), output_node)
(RUF005)
464-467: Avoid specifying long messages outside the exception class
(TRY003)
472-475: Avoid specifying long messages outside the exception class
(TRY003)
480-484: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
150-150: Unused method argument: cm
(ARG002)
151-151: Unused method argument: factory
(ARG002)
152-152: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
86-86: Consider moving this statement to an else block
(TRY300)
449-449: Loop control variable target_name not used within loop body
(B007)
449-449: Loop control variable input_idx not used within loop body
(B007)
535-535: Do not catch blind exception: Exception
(BLE001)
648-648: Do not catch blind exception: Exception
(BLE001)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
136-136: Unused method argument: cm
(ARG002)
137-137: Unused method argument: factory
(ARG002)
138-138: Unused method argument: shared_config
(ARG002)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
12-12: Probable insecure usage of temporary file or directory: "/tmp/test_ad_export_onnx_qwen2.5-0.5b"
(S108)
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
33-33: Unused function argument: context_lengths
(ARG001)
34-34: Unused function argument: rope_rotary_cos_sin
(ARG001)
35-35: Unused function argument: kvcache_start_index
(ARG001)
37-37: Unused function argument: enable_tree_attention
(ARG001)
38-38: Unused function argument: head_size
(ARG001)
39-39: Unused function argument: num_kv_heads
(ARG001)
40-40: Unused function argument: num_q_heads
(ARG001)
81-81: Unused function argument: context_lengths
(ARG001)
82-82: Unused function argument: rope_rotary_cos_sin
(ARG001)
83-83: Unused function argument: kvcache_start_index
(ARG001)
84-84: Unused function argument: enable_tree_attention
(ARG001)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
246-246: Do not catch blind exception: Exception
(BLE001)
271-271: Do not catch blind exception: Exception
(BLE001)
293-293: Unused method argument: cm
(ARG002)
367-367: Unused method argument: cm
(ARG002)
502-502: Consider (input_ids_node, *node.args[1:]) instead of concatenation
Replace with (input_ids_node, *node.args[1:])
(RUF005)
511-511: Unused method argument: factory
(ARG002)
512-512: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
127-127: Unused method argument: cm
(ARG002)
128-128: Unused method argument: factory
(ARG002)
129-129: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
38-38: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
85-85: Avoid specifying long messages outside the exception class
(TRY003)
121-121: Avoid specifying long messages outside the exception class
(TRY003)
135-135: Avoid specifying long messages outside the exception class
(TRY003)
164-166: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
422-422: Unused method argument: factory
(ARG002)
423-423: Unused method argument: shared_config
(ARG002)
544-544: Unused method argument: cm
(ARG002)
546-546: Unused method argument: shared_config
(ARG002)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
131-131: Do not catch blind exception: Exception
(BLE001)
149-154: Avoid specifying long messages outside the exception class
(TRY003)
251-251: Avoid specifying long messages outside the exception class
(TRY003)
257-257: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
257-257: Avoid specifying long messages outside the exception class
(TRY003)
261-263: Prefer TypeError exception for invalid type
(TRY004)
261-263: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
330-330: Do not catch blind exception: Exception
(BLE001)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
26-26: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
27-27: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
80-80: Unused method argument: position_ids
(ARG002)
161-161: Unused function argument: atol
(ARG001)
162-162: Unused function argument: rtol
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (41)
requirements.txt (1)
13-14: LGTM!The
onnxscriptdependency is properly pinned to version0.5.4for reproducibility, addressing the previous review feedback. Keepinggraphvizunpinned is acceptable given its stable release status.tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (3)
1-14: LGTM!The NVIDIA copyright header is properly included, addressing the previous review feedback.
94-131: LGTM!The GatherND insertion logic correctly manipulates the FX graph, inserting the unsqueeze and GatherND nodes before the linear layer and properly rewiring the inputs.
133-173: LGTM!The
_applyorchestration is well-structured with clear steps and proper cleanup. The unused arguments (cm,factory,shared_config) are required by theBaseTransforminterface signature.tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (3)
1-26: LGTM!The copyright header and imports are properly structured. The class docstring clearly documents the four transformations performed for DriveOS compatibility.
69-116: LGTM!The method correctly identifies attention plugin outputs, traces through getitem and reshape nodes, and inserts float16 casts. The
replace_all_uses_withfollowed byupdate_argis the proper pattern to avoid circular references.
147-167: LGTM!The
_applymethod correctly orchestrates all adaptation steps. The unused arguments are required by theBaseTransforminterface.tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)
1-18: LGTM!Copyright header and version constant are properly defined.
157-172: LGTM!The vision config export correctly validates the presence of vision configuration and preserves the original config for MRoPE compatibility.
175-203: LGTM!The dispatch logic correctly routes to the appropriate exporter based on model type and handles multimodal models by extracting
text_config.tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (3)
1-14: LGTM!Copyright header and module docstring are properly structured.
107-155: LGTM!The fallback pattern for handling tokenizers that don't support multimodal content is appropriate. The inner exception properly chains the original error.
340-443: LGTM!The chat template extraction logic is well-structured, correctly handling system/user/assistant roles, multimodal content patterns for VLMs, and default system prompt detection. The output JSON structure is comprehensive.
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
154-154: LGTM! Mode extension follows existing patterns.The addition of
"export_driveos_llm_onnx"to the mode literal and its corresponding YAML mapping is consistent with the existing architecture and properly extends the configuration system for the new ONNX export workflow.Also applies to: 337-337
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (1)
26-27: noqa directives are actually needed despite Ruff warning.The
# noqa: F401comments are intentionally suppressing unused import warnings because these imports register custom ops (torch.ops.auto_deploy.*) via side effects. Ruff 0.14.8's RUF100 rule incorrectly flags these as unnecessary.The imports serve to register the custom operators before the test runs, and removing the noqa would cause linters to flag them as unused imports. Keep the noqa directives as-is.
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)
1-148: LGTM! Well-structured ONNX export configuration.The YAML configuration provides a comprehensive pipeline for DriveOS LLM ONNX export with:
- Clear stage organization (factory, export, pattern_matcher, sharding, etc.)
- Good documentation (e.g., lines 56-60 explaining why optimize_rope is disabled)
- Consistent structure matching existing configs
The export pipeline sequence (fuse_rope_attention → short_reshape_attention_output → gather_last_token_ids → adapt_to_driveos_llm → export_to_onnx) logically flows from graph optimization to final export.
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)
25-34: LGTM! Recursive post-init utility is well-designed.The
_call_post_init_recursivehelper correctly traverses the TreeSpec hierarchy from leaves to root, ensuring internal cached values (likenum_leaves) are updated after modifyingchildren_specs. This is essential for maintaining pytree integrity.
265-324: LGTM! name_prefix addition improves flexibility.Adding the
name_prefixparameter allows customization of placeholder naming while maintaining backward compatibility with the default "arg_" prefix. The implementation correctly applies the prefix consistently across placeholder creation, orig_args, and context.
440-525: LGTM! Input removal correctly handles args vs kwargs distinction.The
remove_graph_inputfunction properly addresses the past review concern about index mismatches by:
- Computing the global placeholder index (line 493)
- Determining if it's an arg or kwarg based on position (lines 496-508)
- Computing a relative index for the appropriate spec (lines 501, 506)
- Using relative_idx for all spec operations (lines 517-520)
This ensures correct removal regardless of whether the input is a positional arg or kwarg.
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (2)
175-538: LGTM! Comprehensive visualization implementation.The
to_dotfunction provides sophisticated graph visualization with:
- Intelligent port assignment for multi-input/output nodes
- Edge coloring and width based on tensor properties
- Special handling for large fan-in/fan-out nodes via duplication
- Constant node creation for non-Node inputs
The magic number 9223372036854775807 (2^63-1) handling on lines 298-299, 304-305 is a valid workaround for what appears to be a torch.SymInt representation issue where -1 gets converted to LONG_MAX.
The broad exception catching on line 535 is acceptable for visualization code, as failures shouldn't break the pipeline. Good practice to log the error and save the text dump as fallback.
541-687: LGTM! Graph analysis utility provides useful metrics.The
analyze_graph_structurefunction computes meaningful graph statistics (node counts, connections, complexity, depth) that can aid in debugging and optimization decisions. The recursive depth calculation correctly handles nested dependencies.tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)
124-171: LGTM! Reshape optimization correctly handles dynamic shapes.The
_applymethod effectively:
- Identifies reshape nodes following AttentionPlugin outputs (lines 132-133)
- Extracts dynamic dimensions from the input tensor using
sym_size.int(lines 143-144)- Creates new reshape nodes with symbolic shape
[dim0, dim1, -1](lines 147-153)- Properly replaces and removes old nodes (lines 156-159)
- Runs post-pass cleanup and shape propagation (lines 163-166)
This ensures reshapes handle dynamic batch and sequence dimensions correctly for ONNX export.
examples/auto_deploy/onnx_export_llm.py (1)
38-42: The batch_size dimension collapse with small values (e.g., max_batch_size=2) is a known limitation in PyTorch's ONNX export when using small max values in dynamic shape definitions. The documented workaround using batch_size=13 is appropriate and correctly implemented. The dynamic shape configuration in export_to_onnx.py properly specifies symbolic dimensions per PyTorch standards. No code changes required; the existing comment adequately documents the workaround and its rationale.tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (6)
1-13: LGTM! Copyright header added.The NVIDIA copyright header is now properly included at the top of the file.
29-93: LGTM!The
MatchResultclass is well-documented with clear attributes and a useful__repr__for debugging.
123-291: LGTM!The pattern matching logic is thorough with good step-by-step validation and informative debug logging. The broad exception handling is acceptable here since pattern matching should be resilient to unexpected graph structures.
293-362: LGTM!The placeholder creation correctly derives symbolic dimensions from the token_ids tensor and maintains proper shape propagation.
364-474: LGTM!The replacement logic correctly creates the fused AttentionPlugin with proper input wiring, output handling, and dead code elimination. Based on learnings, the float16 cast is intentional for DriveOS LLM compatibility.
507-556: LGTM!The orchestration method correctly sequences the transform stages: pattern matching, placeholder creation, replacement, cleanup, recompilation, linting, and shape propagation. The unused
factoryandshared_configarguments are interface requirements fromBaseTransform.tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (5)
1-21: LGTM! Copyright header and module docstring.The file now includes the required NVIDIA copyright header and has a clear module-level docstring explaining the purpose of these placeholder custom operations.
28-74: LGTM!The
attention_plugincustom op is correctly registered as a placeholder with comprehensive docstrings. The unused arguments are expected since this is a placeholder for ONNX export.
77-117: LGTM!The fake implementation correctly computes output shapes for
torch.compiletracing, properly handling KV-cache accumulation withpresent_kv_len = seq_len + past_len.
120-147: LGTM!The
_dummy_gather_ndhelper now correctly handles both 2D and 3D indices with a unified shape computation, addressing the previous issue about the missing return for 2D indices.
150-195: LGTM!The
gather_ndandgather_nd_fakeimplementations correctly delegate to the shared_dummy_gather_ndhelper, maintaining consistency between the actual op and its fake implementation.tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (7)
1-42: LGTM! Copyright header and imports.The file now includes the required NVIDIA copyright header and has appropriate imports for ONNX schema registration and export functionality.
44-58: LGTM!The
ExportToONNXConfigclass provides clean configuration with sensible defaults and proper field descriptions.
60-117: LGTM!The custom RoPE schema is well-defined with proper type constraints and the translation function correctly creates the ONNX custom op node.
120-142: LGTM!The
custom_simple_linear_opcorrectly handles the weight transpose and optional bias addition, whilecustom_gather_nd_opproperly delegates to the standard ONNX GatherND op.
145-271: LGTM!The
AttentionPluginschema is comprehensive with proper input/output specifications, type constraints, and required attributes. The translation function correctly maps to the TRT domain opset.
273-405: LGTM!The
torch_attentionschema correctly defines the SDPA operation with GQA support, and the translation function properly handles theis_causalbool-to-int conversion for ONNX compatibility.
562-588: LGTM!The
_applymethod correctly orchestrates the export workflow, creates necessary directories, and returns the originalGraphModuleunchanged since this is an export-only transform.
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
Outdated
Show resolved
Hide resolved
db15c67 to
59a90e4
Compare
[none][feat] Add AutoDeploy export-onnx mode
Add a new mode "export-onnx" to AutoDeploy.
The mode is almost identical to the default one with 2 difference:
1. Fuse torch_rope_with_explicit_cos_sin &
torch_cached_attention_with_cache into onnx_rope_attnetion
2. The result is not TRT Engine but .onnx
Files added:
- export_onnx.py: The transformation to fuse the ops
- graph_module_visualizer.py: Convert GraphModule to .dot
- examples/onnx_export_llm.py: Example usage
- onnx_driveos_llm.yaml: The new mode config file
- onnx_attnetion.py: The definition of the fused op
[none][feat] fix small graphviz bug, remove useless code
[none][feat] Rename mode from onnx_driveos_llm to export_driveos_llm_onnx
[none][feat] Rename export_onnx.py to fuse_rope_attention.py
[none][feat] Annotate .meta['val'] with add_graph_input()
[none][feat] Successfully export .onnx
[none][feat] Add set_kvcache_placeholder_metadata transform
[none][feat] Skip torch_cached_attention_prepare_metadata
[none][feat] Fix SetKVCachePlaceholderMetadata transform
[none][feat] Remove unused placeholder of prepare_metadata
[none][feat] Fix to run DeepSeek-R1
[none][feat] Add remove_graph_input, refactor remove_unused_placeholder()
[none][feat] Merge K&V cache placeholder
[none][feat] Replace sin_cos with input
[none][feat] Manually fuse rope & attn
[none][feat] Export torch_attention_bsnd_grouped_sdpa with dynamic shape
[none][feat] Manually match rope & attn, not replace yet
[none][feat] Successfully export ONNX with dynamic input
[none][feat] Hack out_spec to add graph output
[none][feat] Fix present_key_values shape
[none][feat] Fix input & output names
[none][feat] Change out_spec in add_graph_output
[none][feat] Fix export of torch_linear_simple
The original translation misses a transpose on the weight.
[none][feat] Fix present_key_values shape
[none][feat] Rewire reshape's new shape as TRT-LLM edge
[none][feat] Fix non-text rebase conflicts
[none][feat] Fix AttentionPlugin domain. should be "" not "ai.onnx"
[none][feat] Enhance visualize, use .meta["val"] instead of .meta["tensor_meta"]
[none][feat] Fix visualize tensor width calculation
When calculate the width of the tensor, check it the dimension is a int or SymInt.
The original implementation accidentally introduce constraints to the symbol int.
I don't execlty know how it happen. actually I don't think it should
introduce new constraints, but it dose.
[none][feat] Fix output dynamic batch_size
Originally max batch size is 2, however, don't know why, when set to 2,the batch_size will collapse to literal static int 2 even we explicitly it is dynamic axis.
And more weird, when set to 13, the batch_size will be dynamic.
default=13, # to enable dynamic batch_size, the match size must > 1
[none][feat] Rename fuse_rope_attention_manually to fuse_rope_attention
[none][feat] Remove fuse_rope_attention.py
[none][feat] Rewire reshape to make the graph like Luxiao's
[none][feat] Fix last_token_ids dtype from i32 to i64
[none][feat] Catch up update to date DriveOS LLM
- Add placeholder kvcache_start_index
- AttentionPlugin add input kvcache_start_index
- Insert Unsqueeze -1 before GatherND
- rope_rotary_cos_sin dynamic axis name changed from
rope_max_position_length to max_position_embeddings
- logits' dtype should be float32, insert a cast
- Insert cast to f16 before AttentionPlugin
- All cast to bf16 should be f16
[none][feat] Catch up update to date DriveOS LLM
- model.half() convert whole model to f16, including weight
- Remove AttentionPlugin attribute kv_cache_capacity & max_batch_size
- AttentionPlugin output[1] shape infer by seq_len + past_len
- AttentionPlugin domain changed from `onnx.ai` to `trt`
- Placeholder `kvcache_start_index` dynamic axes changed from `batch_size` to `kv_cache_start_batch_size`
[none][feat] Catch up-to-date main
[none][feat] Add test for fuse_rope_attention transform
- Add test for fuse_rope_attention
- Enhance run_test_transformed_gm support Module with multiple input
- Fix add_graph_output for graph with only one _LEAF_SPEC
[none][feat] Add unit test for fuse_rope_attn
- Add a unit test
- Fix add_graph_output when out_spec is _LEAF_SPEC
[none][feat] Export .json files
[none][feat] add AutoDeploy export onnx end-to-end test
[none][feat] Export ONNX with cpu to reduce GPU memory footprint
[none][feat] Use model.config to get head_dim, instead of using literal
[none][feat] Visualize graph only when env var AD_DEBUG_VISUALIZE_DIR is set
- Now we don't visualize by default, only when AD_DEBUG_VISUALIZE_DIR is set.
- Also, AD_DEBUG_VISUALIZE_DIR is the output dir, so you can specify the output dir
- Simplify the logging message, move lots of msg to debug instead of info
- Add .cursor to .gitignore
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
The dimension is wrong, it should be num_kv_head. but it is hardwired to 2. Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
- Remove "arg_" prefix for add_graph_input(). See NVIDIA#10117 (comment) Remove all prefix in the code to fix the regression. - Replace test_ad_export_onnx model path from "Qwen/Qwen2.5-0.5B" to "/home/scratch.trt_llm_data/llm-models/Qwen2.5-0.5B-Instruct" to avoid "huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests ..." error. Signed-off-by: yocox <[email protected]>
- Simplify adapt_to_driveos_llm clean up graph. Remove:
gm.graph.eliminate_dead_code()
gm.graph.lint()
gm.recompile()
and Use is_clean=False instead.
- Adapt transformations from main:
- match_bmm_moe_pattern
- fuse_fp8_moe
- fuse_nvfp4_moe
- Enhance comment for debug_visualize_dir option
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
- Specify model path with get_small_model_config - Simplify test code, make batch size and seq not argument Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
Signed-off-by: yocox <[email protected]>
9835586 to
c90c23b
Compare
|
/bot run |
|
PR_Github #34175 [ run ] triggered by Bot. Commit: |
|
PR_Github #34175 [ run ] completed with state
|
|
/bot run |
|
PR_Github #34224 [ run ] triggered by Bot. Commit: |
|
PR_Github #34224 [ run ] completed with state |
lucaslie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a great last minute addition to add the explicit export_onnx function :)
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #34227 [ run ] triggered by Bot. Commit: |
|
PR_Github #34227 [ run ] completed with state |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #34235 [ run ] triggered by Bot. Commit: |
|
PR_Github #34235 [ run ] completed with state |
Description
The issue
To run inference on DriveOS LLM efficiently, the graph must be optimized. Currently the workflow of DriveOS LLM is rewrite the graph in ONNX level. However, matching and manipulation on ONNX is difficult and tedious because is has been converted to fine-grained ops.
The solution
In this PR, we utilize TensorRT LLM AutoDeploy. AutoDeploy already done lots of optimization, and we only have to adapt the optimized graph to feat DriveOS LLM's requirements and export the result graph to checkpoint with ONNX format.
The original workflow:
The new workflow
We add a new AutoDeploy configuration
export_driveos_llm_onnxto do that. This configuration is like the default configuration with follow difference:AttentionPluginop.AttentionPlugin.Visualization
During the development we also add a debugging feature to visualize the FX Graph after each transformation. See issue #9370
Test Coverage
This PR has 2 tests:
To run the test
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.