[None][feat] Export ONNX for DriveOS LLM #10117

nvyocox · 2025-12-18T09:41:11Z

Description

The issue

To run inference on DriveOS LLM efficiently, the graph must be optimized. Currently the workflow of DriveOS LLM is rewrite the graph in ONNX level. However, matching and manipulation on ONNX is difficult and tedious because is has been converted to fine-grained ops.

The solution

In this PR, we utilize TensorRT LLM AutoDeploy. AutoDeploy already done lots of optimization, and we only have to adapt the optimized graph to feat DriveOS LLM's requirements and export the result graph to checkpoint with ONNX format.

The original workflow:

HF model ------------------> ONNX ---> TRT Engine
             DriveOS LLM

The new workflow

HF model ---> FX Graph ------------------ FX Graph -----> ONNX ---> TRT Engine
                       AutoDeploy rewrite          export

We add a new AutoDeploy configuration export_driveos_llm_onnx to do that. This configuration is like the default configuration with follow difference:

It does not rewrite the attention to KV cached attention.
It fuse the rope & attention ops into DriveOS LLM's AttentionPlugin op.
It export the result graph into ONNX.
It add new graph output for each attention, return the new KV cache result.
It add some new input for DriveOS LLM's AttentionPlugin.

Visualization

During the development we also add a debugging feature to visualize the FX Graph after each transformation. See issue #9370

Test Coverage

This PR has 2 tests:

Test transformation to fuse the rope & attention op.
End-to-end test on Qwen2.5-0.5B model

To run the test

cd tests
# Fuse unit test
pytest unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
# End to end test
pytest unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py

Summary by CodeRabbit

Release Notes

New Features
- Added support for exporting LLM models to ONNX format for DriveOS deployment with automatic configuration and tokenizer export.
- Added graph visualization capabilities to analyze model architecture as Graphviz diagrams.
- Added chat template processing for vision-language and text-only models.
Tests
- Added ONNX export and RoPE attention fusion transformation tests.
Chores
- Updated dependencies and Docker base image configuration.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-12-18T09:47:45Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive ONNX export pipeline for LLMs via AutoDeploy, featuring new transforms for graph optimization, custom ONNX operations, configuration handling, and visualization capabilities to support DriveOS LLM deployment.

Changes

Cohort / File(s)	Summary
Configuration and Dependencies `.gitignore`, `requirements.txt`, `docker/common/install_base.sh`	Added `.cursor` to ignored files; added `onnxscript==0.5.4` and `graphviz` dependencies; updated Docker base image to include graphviz.
ONNX Export Configuration `tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml`	New multi-stage YAML pipeline for graph-mode ONNX export, defining build, pattern matching, quantization, fusion, and ONNX export stages with detailed transform configurations.
AutoDeploy Core Extensions `tensorrt_llm/_torch/auto_deploy/llm_args.py`, `tensorrt_llm/_torch/auto_deploy/transform/interface.py`	Extended `AutoDeployConfig.mode` to include `"export_driveos_llm_onnx"` mode with YAML routing; added `EXPORT_ONNX` stage to transform pipeline.
Custom ONNX Operations `tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py`	New module defining placeholder custom ops for ONNX export: `AttentionPlugin`, `GatherND`, with fake implementations for shape inference and export compatibility.
Graph Transform Library `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py`, `adapt_to_driveos_llm.py`, `fuse_rope_attention.py`, `gather_last_token_ids.py`, `short_reshape_attention_output.py`	New transforms: `ExportToONNX` (main ONNX export with config/JSON file handling), `AdaptToDriveOSLLM` (dtype conversion and cast insertion), `FuseRopeAttention` (RoPE+attention fusion), `GatherLastTokenIds` (token gathering), `ShortReshapeAttentionOutput` (reshape optimization).
Configuration and Utilities `tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py`, `_chat_template.py`	New modules for exporting LLM/Eagle configs with required field extraction and version tagging; chat template processing with VLM detection and role/content pattern extraction.
Graph Utilities and Visualization `tensorrt_llm/_torch/auto_deploy/utils/_graph.py`, `tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py`	Extended graph I/O utilities with `add_graph_output`, `remove_graph_input`, `add_graph_input` name prefixing; new `graph_module_visualizer` for Graphviz diagram generation with structure analysis; preserved model config during export.
Transform Orchestration `tensorrt_llm/_torch/auto_deploy/transform/optimizer.py`	Added diagram generation on transform completion when `AD_DEBUG_VISUALIZE_DIR` is set for debugging.
User-Facing Scripts and Tests `examples/auto_deploy/onnx_export_llm.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py`	New example script demonstrating ONNX export workflow; integration test validating full export pipeline (files, ONNX model structure, inputs/outputs); unit test for RoPE attention fusion with model pattern validation.
Reference Implementation Updates `tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py`	Updated torch op invocations to use `.default(...)` wrapper for cached attention operations.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant AutoDeployConfig
    participant Optimizer
    participant Transforms as Transform<br/>Pipeline
    participant ONNX as ONNX Export
    participant Files as Output Files
    
    User->>AutoDeployConfig: Create with mode=<br/>"export_driveos_llm_onnx"
    AutoDeployConfig->>AutoDeployConfig: Load config from<br/>export_driveos_llm_onnx.yaml
    User->>Optimizer: Build and optimize<br/>graph
    
    rect rgb(220, 240, 255)
    Note over Transforms: Transform Pipeline Execution
    
    Optimizer->>Transforms: FuseRopeAttention<br/>→ Pattern match + fuse RoPE+attention
    Transforms->>Transforms: Add context_lengths,<br/>rope_rotary_cos_sin placeholders
    
    Optimizer->>Transforms: AdaptToDriveOSLLM<br/>→ Convert to float16, insert casts
    
    Optimizer->>Transforms: GatherLastTokenIds<br/>→ Add token gathering
    
    Optimizer->>Transforms: ShortReshapeAttentionOutput<br/>→ Optimize reshape nodes
    end
    
    Optimizer->>ONNX: ExportToONNX transform
    rect rgb(240, 255, 240)
    Note over ONNX: ONNX Export Phase
    ONNX->>ONNX: Prepare dynamic shapes<br/>and placeholders
    ONNX->>ONNX: Register custom ops<br/>(AttentionPlugin, GatherND, RoPE)
    ONNX->>ONNX: Call torch.onnx.export<br/>with dynamo=True
    end
    
    ONNX->>Files: Export model.onnx
    ONNX->>Files: Export config.json<br/>(via _config_export)
    ONNX->>Files: Export processed_<br/>chat_template.json
    ONNX->>Files: Copy tokenizer files<br/>(vocab, tokens, maps)
    
    Files->>User: Return output directory<br/>with all artifacts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

export_to_onnx.py: Dense logic with multiple custom ONNX op schema definitions, translation hooks, and dynamic shape construction; validate custom op signatures and ONNX compatibility.
fuse_rope_attention.py: Pattern matching and graph rewriting with placeholder augmentation and dead-code elimination; ensure correctness of node traversal and replacement logic.
onnx_attention.py: Custom op definitions with shape inference and fake implementations; verify tensor shape contracts and export-time behavior consistency.
graph_module_visualizer.py: Complex port assignment logic with edge rendering, node labeling, and depth calculation; check correctness of graph structure analysis.
_config_export.py and _chat_template.py: Configuration extraction with field validation and optional multimodal handling; validate error cases and field fallbacks.
Integration across transforms: Verify correct sequencing and data flow through multiple graph transformation stages in export_driveos_llm_onnx.yaml and optimizer.py.
Test coverage: New test files validate end-to-end export and fusion; ensure parametrization covers edge cases (batch sizes, sequence lengths, model architectures).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.21% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title '[None][feat] Export ONNX for DriveOS LLM' clearly describes the main feature addition of ONNX export capability for DriveOS LLM.
Description check	✅ Passed	The PR description clearly explains the issue, solution, and workflow improvements. It includes the original vs. new workflow, key differences in the new AutoDeploy configuration, and mentions visualization and test coverage.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 20

🧹 Nitpick comments (22)

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
9-12: Remove unused imports or clarify their purpose.

The three new imports (_detect_fake_mode_from_gm, FakeTensor, _extract_tensor_metadata) are not referenced anywhere in this file. Additionally, these are private PyTorch APIs (indicated by _ prefix), which may change without notice in future PyTorch releases.

If these imports are intended for near-term use in this file, consider adding a TODO comment explaining their purpose. Otherwise, remove them to keep the imports clean.
🔎 Apply this diff to remove the unused imports:
 from pydantic import Field
-from torch._export.utils import _detect_fake_mode_from_gm
-from torch._subclasses import FakeTensor
 from torch.fx import GraphModule, Node
-from torch.fx.passes.shape_prop import _extract_tensor_metadata
 
 from ...custom_ops.attention_interface import (
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)

1-1: Verify NVIDIA copyright header presence.

According to coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header with the year of latest meaningful modification. No copyright header is visible at the start of this file. Since this file is being modified in 2025, please verify whether the header exists or needs to be added.

As per coding guidelines, all TensorRT-LLM OSS code requires an NVIDIA copyright header.
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (3)
72-80: Avoid direct tuple mutation.

Line 78 directly assigns to cast_node.args, which mutates a tuple in place. While this works for FX graph nodes, it's safer to use the provided update_arg method for consistency and clarity.
🔎 Apply this diff to use the update_arg method:
     num_changed = 0
     for cast_node in cast_nodes:
         if cast_node.args[1] == torch.bfloat16:
-            cast_node.args = (cast_node.args[0], torch.float16)
+            cast_node.update_arg(1, torch.float16)
             num_changed += 1
     return num_changed
82-83: Clarify return type of _to_float16.

The method signature suggests it returns bool, but it calls gm.half() (which returns None implicitly). Consider either removing the return type annotation or explicitly returning a meaningful value.
🔎 Apply this diff to fix the return type:
-    def _to_float16(self, gm: GraphModule) -> bool:
+    def _to_float16(self, gm: GraphModule) -> None:
         gm.half()
16-28: Consider adding a docstring.

The method performs non-trivial graph manipulation. A docstring would improve maintainability by documenting its purpose, the transformation it applies, and what it returns.
examples/auto_deploy/onnx_export_llm.py (1)
3-3: Remove unused import.

The onnxscript.opset22 import is not used in this file.
🔎 Apply this diff to remove the unused import:
 import argparse
 
-from onnxscript import opset22 as opset22
-
 from tensorrt_llm._torch.auto_deploy import LLM, AutoDeployConfig
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
12-12: Consider using tempfile.mkdtemp() for test output directory.

The hardcoded /tmp/test_ad_export_onnx_qwen2.5-0.5b path may cause issues in environments where /tmp is not available or on Windows. Consider using Python's tempfile module for more portable tests.
🔎 Example using tempfile:
import tempfile
import shutil

@pytest.mark.parametrize(
    "model, max_batch_size, max_seq_len, num_attn_ops",
    [
        ("Qwen/Qwen2.5-0.5B", 13, 4, 24),
    ],
)
def test_ad_export_onnx(
    model: str, max_batch_size: int, max_seq_len: int, num_attn_ops: int
):
    output_dir = tempfile.mkdtemp(prefix="test_ad_export_onnx_")
    try:
        ad_config = AutoDeployConfig(
            model=model,
            mode="export_driveos_llm_onnx",
            max_batch_size=max_batch_size,
            max_seq_len=max_seq_len,
        )
        ad_config.transforms["export_to_onnx"]["output_dir"] = output_dir
        # ... rest of test
    finally:
        shutil.rmtree(output_dir, ignore_errors=True)
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)
1-3: Missing NVIDIA copyright header.

Per the coding guidelines, all TensorRT-LLM Open Source Software code (including .py files referenced by .yaml configs) should contain an NVIDIA copyright header. While YAML config files may have different requirements, consider adding a copyright header for consistency with other configuration files in the repository.
🔎 Suggested header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # This is the set of transforms running in "graph" mode. In this mode, we capture the full graph
 # of the model and optimize it for inference.
 transforms:
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)
14-39: Typo in method name: "accending" → "ascending".
🔎 Fix the typo:
-    def _lookup_accending_node(self, node: Node, target, max_depth: int = 3) -> Node:
+    def _lookup_ascending_node(self, node: Node, target, max_depth: int = 3) -> Node:
         if max_depth == 0:
             return None
         if node.target == target:
             return node

         # Helper function to check a single node
         def check_node(n):
             if isinstance(n, Node):
-                result = self._lookup_accending_node(n, target, max_depth - 1)
+                result = self._lookup_ascending_node(n, target, max_depth - 1)
                 if result is not None:
                     return result
             return None
Also update the call sites at lines 49-50 and 56-57.
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (3)
514-521: Broad exception handling may hide errors.

Catching bare Exception can mask underlying issues. Consider catching more specific exceptions (e.g., graphviz.ExecutableNotFound, OSError).
🔎 Suggested fix:
     try:
         dot.render(save_path, cleanup=True)
         print(f"✅ Diagram saved: {save_path}.{format}")
         with open(save_path + ".txt", "w") as f:
             f.write(str(graph_module.graph))
-    except Exception as e:
+    except (OSError, IOError) as e:
         print(f"❌ Failed to save diagram: {e}")
386-399: Remove commented-out debug code.

These commented-out print statements appear to be debugging artifacts. Consider removing them for cleaner code.
🔎 Clean up:
-    # Print 10 nodes with most inputs
-    # print("Nodes with most inputs:")
-    # node_inputs_sorted = sorted(node_inputs.items(), key=lambda x: len(x[1]), reverse=True)
-    # for node_name, input_list in node_inputs_sorted[:10]:
-    #     print(f"  {node_name}: {len(input_list)}")

-    # Print 10 nodes with most outputs
     node_outputs_sorted = sorted(node_outputs.items(), key=lambda x: len(x[1]), reverse=True)
-    # print("Nodes with most outputs:")
     large_fanout_nodes: Dict[str, int] = {}
     for node_name, output_list in node_outputs_sorted[:10]:
         if len(output_list) > 10:
             large_fanout_nodes[node_name] = 0
-        # print(f"  {node_name}: {len(output_list)}")
617-636: Narrow the exception handling.

The bare Exception catch could hide unexpected errors. Consider catching AttributeError specifically since get_submodule may fail if the target path doesn't exist.
🔎 Suggested fix:
     try:
         # Try to get actual module type name
         actual_module = graph_module.get_submodule(str(target))
         module_type = actual_module.__class__.__name__
         ...
-    except Exception:
+    except (AttributeError, KeyError):
         # If unable to get module, fall back to original logic
         module_name = str(target).split(".")[-1] if "." in str(target) else str(target)
         return module_name
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1)
54-57: Use logging instead of print for warnings.

The codebase appears to use ad_logger for logging in other transform files. Consider using the logger for consistency and better control over log levels.
🔎 Suggested approach:
+from ...utils.logger import ad_logger
 ...
-        print(
-            "Warning: head_dim not found in config, calculating as hidden_size // num_attention_heads"
-        )
+        ad_logger.warning(
+            "head_dim not found in config, calculating as hidden_size // num_attention_heads"
+        )
Apply the same pattern at lines 92-94, 128-130, and 146-149.
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (2)
160-171: Consider narrowing the exception type.

The except Exception clause is overly broad. Since you're accessing shape info from tensor metadata, consider catching more specific exceptions like AttributeError, TypeError, or IndexError.
🔎 Suggested fix:
-            except Exception as e:
+            except (AttributeError, TypeError, IndexError) as e:
                 ad_logger.error(f"  Skipping: failed to extract head_dim: {e}")
                 continue
176-196: Consider narrowing the exception type.

Similar to the previous block, this except Exception clause should catch specific exceptions for tensor metadata access.
🔎 Suggested fix:
-            except Exception as e:
+            except (AttributeError, TypeError, IndexError) as e:
                 ad_logger.error(f"  Skipping: failed to calculate num_heads: {e}")
                 continue
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (3)
48-59: Use logger instead of print and clarify message.

The function uses print() statements and the message "Set use_prompt_tuning" seems unrelated to the is_vlm check. Consider using ad_logger and updating the message to reflect the actual VLM detection.
🔎 Suggested fix:
+from ...utils.logger import ad_logger
+
 def is_vlm(model_dir: str) -> bool:
     """Check if the model is a VLM."""
     cfg = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
     cfg_dict = cfg.to_dict()
     has_vision = "vision_config" in cfg_dict
     has_phi4_vision = "image_embd_layer" in cfg_dict.get("embd_layer", {})
     if has_vision or has_phi4_vision:
-        print("Set use_prompt_tuning to True")
+        ad_logger.debug("Detected VLM model (has vision config)")
         return True
     else:
-        print("Set use_prompt_tuning to False")
+        ad_logger.debug("Detected text-only LLM model")
         return False
124-154: Consider logging the first exception for debugging.

The first except Exception silently falls through to the fallback. Consider logging the original exception to help debug cases where neither path works.
🔎 Suggested fix:
     try:
         # Convert dataclass messages to dictionaries using asdict
         message_dicts = [asdict(msg) for msg in messages]

         return tokenizer.apply_chat_template(
             message_dicts, tokenize=False, add_generation_prompt=add_generation_prompt
         )
-    except Exception:
+    except Exception as e1:
         # Try fallback: convert list content to string for tokenizers that don't support multimodal
         try:
315-336: Use logger for consistency with the codebase.

Multiple print() calls should use ad_logger for consistency with other modules in the codebase.
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (4)
411-424: Use logger instead of print and remove unused args list.

The args list is created but always empty since all placeholders go into kwargs. Also, print() should be replaced with ad_logger.
🔎 Suggested fix:
-        args = []
         kwargs = {}
         placeholders = gm.graph.find_nodes(op="placeholder")
         for ph in placeholders:
             kwargs[ph.name] = ph.meta["val"]
-        args = tuple(args)

-        print("Placeholders args:")
-        for i, e in enumerate(args):
-            print(f"  {i}: {placeholders[i].name:20} {e}")
-
-        print("Placeholders kwargs:")
+        ad_logger.debug("Placeholders kwargs:")
         for k, v in kwargs.items():
-            print(f"  {k}: {v}")
+            ad_logger.debug(f"  {k}: {v}")
440-461: Hardcoded magic numbers for dynamic shape bounds.

The values 16 for rope_batch_size and 4096 for max_position_embeddings/past_len are hardcoded. Consider deriving these from model config or making them configurable.
🔎 Suggested deriving from config:
+        max_position = getattr(gm.config, 'max_position_embeddings', 4096)
+
         dynamic_shapes["rope_rotary_cos_sin"] = {
-            0: Dim("rope_batch_size", min=1, max=16),
-            1: Dim("max_position_embeddings", min=1, max=4096),
+            0: Dim("rope_batch_size", min=1, max=cm.info.max_batch_size),
+            1: Dim("max_position_embeddings", min=1, max=max_position),
         }
         # ...
         for i in range(num_layers):
             dynamic_shapes[f"past_key_values_{i}"] = {
-                3: Dim("past_len", min=1, max=4096),
+                3: Dim("past_len", min=1, max=max_position),
             }
496-511: Remove empty args tuple and dead comment.

Since args is always empty, passing tuple(args) is unnecessary. The commented line 508 should be removed.
🔎 Suggested fix:
         torch.onnx.export(
             gm,
-            tuple(args),
+            (),
             output_path,
             opset_version=20,
             kwargs=kwargs,
             dynamo=True,
             dynamic_shapes=dynamic_shapes,
             report=False,
             output_names=output_names,
             custom_translation_table=custom_translation_table,
         )
-        # export_output.save(output_path)

         ad_logger.info(f"Successfully exported ONNX model to {output_path}")
         return True
513-525: Use logger instead of print.

Replace print() calls with ad_logger for consistency.
🔎 Suggested fix:
         if reduced_vocab_size is not None:
             model_config["reduced_vocab_size"] = reduced_vocab_size
-            print(f"Added reduced_vocab_size={reduced_vocab_size} to config")
+            ad_logger.info(f"Added reduced_vocab_size={reduced_vocab_size} to config")

         config_path = os.path.join(output_dir, "config.json")
         with open(config_path, "w") as f:
             json.dump(model_config, f, indent=2)
-        print(f"Model configuration saved to {config_path}")
+        ad_logger.info(f"Model configuration saved to {config_path}")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df15be3 and a6cf560.

📒 Files selected for processing (24)

.gitignore (1 hunks)
docker/common/install_base.sh (1 hunks)
examples/auto_deploy/onnx_export_llm.py (1 hunks)
requirements.txt (1 hunks)
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1 hunks)
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx_debug.yaml (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (4 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (4 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/transform/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/transform/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py

🧠 Learnings (10)

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

📚 Learning: 2025-10-20T17:09:21.560Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

📚 Learning: 2025-08-09T02:04:49.623Z

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

🧬 Code graph analysis (13)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

AutoDeployConfig (54-339)

to_llm_kwargs (315-325)

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (4)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

ModelFactory (94-346)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

CachedSequenceInterface (11-92)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)

run_shape_prop (223-248)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (6)

BaseTransform (220-507)

SharedConfig (62-69)

TransformInfo (124-181)

TransformRegistry (510-538)

register (516-523)

_apply (482-495)

examples/auto_deploy/onnx_export_llm.py (1)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

to_llm_kwargs (315-325)

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (9)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

ModelFactory (94-346)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

CachedSequenceInterface (11-92)

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

is_op (198-221)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (3)

BaseTransform (220-507)

TransformInfo (124-181)

_apply (482-495)

tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)

target (602-603)

tensorrt_llm/functional.py (1)

replace_all_uses_with (556-573)

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1)

_apply (119-159)

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

_apply (156-241)

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)

_apply (66-113)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (6)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

CachedSequenceInterface (11-92)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)

add_graph_input (251-315)

add_graph_output (350-434)

remove_graph_input (502-561)

tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

is_op (198-221)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

torch_attention (96-212)

tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)

call_function (249-276)

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1)

AttentionPlugin (9-25)

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (6)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

get (526-528)

tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py (1)

target (602-603)

tensorrt_llm/_utils.py (1)

numel (1002-1003)

cpp/tensorrt_llm/thop/alltoallOp.cpp (1)

output_list (73-73)

docker/common/install_base.sh (1)

cleanup (23-44)

cpp/kernels/xqa/mha_sm90.cu (1)

tokens (529-532)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (68-95)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

torch_attention (96-212)

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

torch_export_to_gm (276-344)

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)

InferenceOptimizer (24-94)

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

tensorrt_llm/_torch/autotuner.py (1)

FakeTensor (154-157)

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)

tests/unittest/llmapi/apps/test_chat_utils.py (1)

chat_template_path (188-193)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

get (526-528)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

to_dict (311-313)

to_dict (446-451)

tensorrt_llm/bench/benchmark/__init__.py (1)

model_type (70-71)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

args (28-30)

tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)

tensorrt_llm/builder.py (1)

default (45-50)

🪛 Ruff (0.14.8)

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py

122-122: Unused method argument: cm

(ARG002)

123-123: Unused method argument: factory

(ARG002)

124-124: Unused method argument: shared_config

(ARG002)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py

12-12: Probable insecure usage of temporary file or directory: "/tmp/test_ad_export_onnx_qwen2.5-0.5b"

(S108)

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py

69-69: Unused method argument: cm

(ARG002)

70-70: Unused method argument: factory

(ARG002)

71-71: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

88-88: Unused method argument: cm

(ARG002)

89-89: Unused method argument: factory

(ARG002)

90-90: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

169-169: Do not catch blind exception: Exception

(BLE001)

194-194: Do not catch blind exception: Exception

(BLE001)

216-216: Unused method argument: cm

(ARG002)

281-281: Unused method argument: cm

(ARG002)

381-381: Consider (input_ids_node, *node.args[1:]) instead of concatenation

Replace with (input_ids_node, *node.args[1:])

(RUF005)

390-390: Unused method argument: factory

(ARG002)

391-391: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py

77-77: Consider moving this statement to an else block

(TRY300)

191-193: Avoid specifying long messages outside the exception class

(TRY003)

433-433: Loop control variable target_name not used within loop body

(B007)

433-433: Loop control variable input_idx not used within loop body

(B007)

519-519: Do not catch blind exception: Exception

(BLE001)

632-632: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

408-408: Unused method argument: factory

(ARG002)

409-409: Unused method argument: shared_config

(ARG002)

530-530: Unused method argument: cm

(ARG002)

532-532: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

13-13: Unused function argument: context_lengths

(ARG001)

14-14: Unused function argument: rope_rotary_cos_sin

(ARG001)

15-15: Unused function argument: kvcache_start_index

(ARG001)

17-17: Unused function argument: enable_tree_attention

(ARG001)

18-18: Unused function argument: head_size

(ARG001)

19-19: Unused function argument: num_kv_heads

(ARG001)

20-20: Unused function argument: num_q_heads

(ARG001)

32-32: Unused function argument: context_lengths

(ARG001)

33-33: Unused function argument: rope_rotary_cos_sin

(ARG001)

34-34: Unused function argument: kvcache_start_index

(ARG001)

35-35: Unused function argument: enable_tree_attention

(ARG001)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

61-61: Unused method argument: position_ids

(ARG002)

142-142: Unused function argument: atol

(ARG001)

143-143: Unused function argument: rtol

(ARG001)

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py

131-131: Do not catch blind exception: Exception

(BLE001)

149-154: Avoid specifying long messages outside the exception class

(TRY003)

251-251: Avoid specifying long messages outside the exception class

(TRY003)

257-257: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

257-257: Avoid specifying long messages outside the exception class

(TRY003)

261-263: Prefer TypeError exception for invalid type

(TRY004)

261-263: Avoid specifying long messages outside the exception class

(TRY003)

269-269: Avoid specifying long messages outside the exception class

(TRY003)

330-330: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

38-38: Avoid specifying long messages outside the exception class

(TRY003)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

85-85: Avoid specifying long messages outside the exception class

(TRY003)

121-121: Avoid specifying long messages outside the exception class

(TRY003)

135-135: Avoid specifying long messages outside the exception class

(TRY003)

164-166: Avoid specifying long messages outside the exception class

(TRY003)

195-195: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

371-371: Avoid specifying long messages outside the exception class

(TRY003)

382-382: Consider (*tuple(current_outputs), output_node) instead of concatenation

Replace with (*tuple(current_outputs), output_node)

(RUF005)

466-469: Avoid specifying long messages outside the exception class

(TRY003)

479-482: Avoid specifying long messages outside the exception class

(TRY003)

486-489: Avoid specifying long messages outside the exception class

(TRY003)

495-497: Avoid specifying long messages outside the exception class

(TRY003)

540-544: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (35)

tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (1)

59-59: LGTM! Consistent transition to explicit .default dispatcher.

The addition of .default to all torch op invocations correctly uses PyTorch's dispatcher pattern to explicitly call the default implementation. The changes are mechanical, consistent across all four call sites, and maintain identical arguments and behavior.

Also applies to: 100-100, 151-151, 188-188

.gitignore (1)

3-3: LGTM!

Adding .cursor follows the existing pattern for ignoring IDE-specific directories (.vscode, .idea).

docker/common/install_base.sh (1)

56-56: LGTM!

Adding system-level graphviz correctly supports the Python graphviz package for graph visualization in the ONNX export workflow.
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)
257-294: LGTM!

The name_prefix parameter is consistently applied across the placeholder name, orig_args, and context, maintaining backward compatibility with the default "arg_" prefix.

437-499: LGTM!

Robust helper with clear validation and descriptive error messages for both integer index and Node inputs. The detailed exception messages aid debugging despite the static analysis hints.

392-395: Incorrect context initialization for single-output conversion.

When converting a single-output graph to a tuple spec, context=["output"] creates a mismatch - there's one child spec (_LEAF_SPEC) but the context name doesn't correspond to it properly. This could cause issues when later appending to out_spec.context at line 399.
🔎 Suggested fix
     if out_spec == _LEAF_SPEC:
-        new_out_spec = TreeSpec(type=tuple, children_specs=[_LEAF_SPEC], context=["output"])
+        new_out_spec = TreeSpec(type=tuple, children_specs=[_LEAF_SPEC, _LEAF_SPEC], context=["output", name])
         graph._codegen.pytree_info = graph._codegen.pytree_info._replace(out_spec=new_out_spec)
         out_spec = graph._codegen.pytree_info.out_spec
+        # Already added both specs, skip the append below
+        object.__setattr__(out_spec, "type", dict)
+        return
Alternatively, if the intent is to just wrap the existing output, ensure the context reflects the original output name.
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1)

8-48: LGTM!

The AttentionPlugin custom op and its fake implementation correctly define the ONNX export placeholder with proper shape inference. The unused arguments (flagged by static analysis) are expected for signature matching with the target ONNX op.

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1)

85-105: Unused method arguments flagged by static analysis.

The parameters cm, factory, and shared_config are part of the BaseTransform interface but unused in this implementation. This is acceptable if they're not needed for this transform, but consider documenting why they're unused.

Based on learnings and the BaseTransform interface pattern, these parameters are part of the standard transform signature and may be used by other transforms. The unused parameters here are acceptable.

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

52-52: LGTM!

The new EXPORT_ONNX stage is correctly positioned between VISUALIZE and COMPILE, which makes sense for the export workflow. The enum ordering is preserved for stage comparison logic.

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)

13-13: LGTM!

The addition of graph visualization capability via to_dot is well-implemented and appropriately gated behind an environment variable for debugging purposes.

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

154-154: LGTM!

The addition of the "export_driveos_llm_onnx" mode and its YAML mapping follows the existing pattern for mode configuration. The integration is consistent with the broader ONNX export workflow introduced in this PR.

Also applies to: 337-337

examples/auto_deploy/onnx_export_llm.py (1)

31-35: Document the batch_size workaround.

The comment reveals unexpected behavior where max_batch_size=2 causes static shape collapse while max_batch_size=13 enables dynamic shapes. This suggests a potential underlying issue with dynamic shape handling that warrants investigation or proper documentation.

Could you clarify whether this is:

Expected behavior that should be documented in user-facing docs?

A known limitation/bug that should be tracked in an issue?

A temporary workaround that will be fixed?

If this is a known limitation, consider adding a more detailed explanation or reference to an issue tracker.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (2)
42-45: LGTM!

The test comprehensively validates the ONNX export output, including file existence, operator counts, and input/output structure. The assertions properly verify the expected graph structure with AttentionPlugin nodes and named inputs/outputs.

1-1: Add NVIDIA copyright header.

All TensorRT-LLM files should include an NVIDIA copyright header.

As per coding guidelines:

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
🔎 Apply this diff to add the copyright header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Learnt from: tburt-nv
Repo: NVIDIA/TensorRT-LLM PR: 9881
File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36
Timestamp: 2025-12-10T19:14:58.432Z
Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all caps
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)
136-144: Unused test parameters are acceptable.

The static analysis flags position_ids, atol, and rtol as unused. However:

position_ids is part of the model's forward signature and may be used in future test variations

atol/rtol parameters are defined for potential numerical comparison testing

These are acceptable for test infrastructure flexibility.

Also applies to: 192-219

1-1: Add NVIDIA copyright header.

All TensorRT-LLM files should include an NVIDIA copyright header.

As per coding guidelines:

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
🔎 Apply this diff to add the copyright header:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
Learnt from: tburt-nv
Repo: NVIDIA/TensorRT-LLM PR: 9881
File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36
Timestamp: 2025-12-10T19:14:58.432Z
Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all caps
Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.230Z
Learning: Applies to **/*.h : The preprocessor guard name in C++ must have prefix `TRTLLM_` followed by the filename, all in caps. Only use the file name, not directory names
84-86: Address RoPE argument ordering discrepancy in torch_rope_with_explicit_cos_sin call.

The function signature in tensorrt_llm/_torch/auto_deploy/custom_ops/torch_rope.py clearly defines torch_apply_rope_with_explicit_cos_sin(q, k, cos, sin, unsqueeze_dim) with return order (q_embed, k_embed). However, line 86 passes arguments as (k, q, ...) instead of (q, k, ...), which causes the rotations to be swapped. The reference implementation in tensorrt_llm/_torch/auto_deploy/transform/library/rope.py confirms the expected order is (q, k). Either fix the argument order to (q, k, cos, sin, 2) or verify the receiving variables should be k_rot, q_rot = ... if the swap is intentional.
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx_debug.yaml (1)

1-144: LGTM!

The debug configuration is well-structured and appropriately disables performance optimizations to focus on faster export and debugging. The comments clearly explain why transforms are disabled, and the remaining transforms support the core ONNX export workflow.

The structure mirrors the main export config while providing a streamlined path for debugging purposes.

tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)

4-148: LGTM!

The transform configuration is well-organized with clear section headers, appropriate staging, and documented TODOs for disabled features. The workflow from build through ONNX export is logically structured.

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (2)

41-64: LGTM!

The pattern matching logic correctly identifies reshape nodes that are fed by AttentionPlugin and connected to torch_linear_simple nodes.

66-113: LGTM!

The transform logic is well-structured:

Correctly extracts dynamic dimensions using sym_size.int

Properly inserts new reshape nodes before the old ones

Replaces usages and erases old nodes cleanly

Runs appropriate graph cleanup (shape prop, DCE, lint, recompile)

The unused cm, factory, and shared_config arguments are required by the BaseTransform._apply interface signature.

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (3)

15-38: LGTM!

Good validation that the last linear simple node is actually producing the output logits, with appropriate warning if the assumption doesn't hold.

80-117: LGTM!

The GatherND insertion logic is correct and well-documented. The unsqueeze is appropriately applied to prepare indices for the GatherND operation.

119-159: LGTM!

The transform flow is well-structured with proper early returns for error cases and appropriate graph cleanup sequence.

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (2)

46-164: LGTM!

The helper functions for edge width calculation, color assignment, and port mapping are well-implemented with appropriate handling of edge cases (single input/output, many inputs/outputs).

638-671: LGTM!

The graph depth calculation with memoization is correctly implemented.

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)

1-18: LGTM!

Good to see the proper NVIDIA copyright header included.

68-154: LGTM!

The EAGLE configuration export logic is well-structured with appropriate field validation. The base_model_hidden_size = target_hidden_size * 3 logic is documented with a reference to the original llm_export.py.

157-202: LGTM!

The public export functions properly validate inputs and dispatch to the appropriate internal handlers. The export_llm_config function correctly handles multimodal models by extracting text_config.

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (3)

15-47: LGTM!

The MatchResult container is well-structured with clear attribute names and a useful __repr__ for debugging.

216-276: LGTM!

The placeholder augmentation logic correctly infers batch and sequence dimensions from the token placeholder and creates appropriately typed placeholders for the attention plugin inputs.

386-434: LGTM!

The _apply method properly orchestrates the transformation: pattern matching, placeholder addition, replacement, cleanup, and graph validation. The early return for zero matches is appropriate.

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)

1-14: LGTM!

Proper NVIDIA copyright header is present.

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (2)

46-128: LGTM!

The ONNX custom op schemas and translation functions are well-defined with proper type constraints and documentation.

548-574: LGTM!

The _apply method properly orchestrates the export workflow: directory creation, JSON file export, and ONNX model export with appropriate status tracking.

examples/auto_deploy/onnx_export_llm.py

requirements.txt

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

...ittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

nvyocox · 2025-12-19T07:41:38Z

@coderabbitai review

coderabbitai · 2025-12-19T07:41:44Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (4)

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1)
54-58: Docstring shape still inconsistent with implementation.

The docstring states int64[batch_size] but the implementation creates a tensor with shape (batch_size, 2) at line 84. Update the docstring to reflect the actual shape.
🔎 Suggested fix:
     def _add_last_token_ids_input(self, gm: GraphModule) -> Node:
         """Add last_token_ids as a graph input.

-        Shape: int64[batch_size]
+        Shape: int64[batch_size, 2] - indices for GatherND operation
         """
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (2)
62-66: Union type syntax requires Python 3.10+.

The str | List[Dict[str, str]] syntax requires Python 3.10+, but per coding guidelines, code should conform to Python 3.8+. Use Union from typing for compatibility.
🔎 Suggested fix:
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any, Dict, List, Optional, Tuple, Union

 @dataclass
 class Message:
     role: str
-    content: str | List[Dict[str, str]] = field(default_factory=list)
+    content: Union[str, List[Dict[str, str]]] = field(default_factory=list)
253-298: Missing return statement and exception chaining.

The function declares -> Dict[str, Any] return type but doesn't return the validated template. Also, the exception at line 257 should chain the original exception.
🔎 Suggested fix:
     try:
         with open(chat_template_path, "r") as f:
             template = json.load(f)
     except json.JSONDecodeError as e:
-        raise ValueError(f"Invalid JSON in chat template file: {e}")
+        raise ValueError(f"Invalid JSON in chat template file: {e}") from e
 
     # ... validation logic ...
 
     print("Chat template validation successful!")
+    return template
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1)
118-137: Return type annotation mismatch and direct args mutation.

The method is annotated as -> bool but returns num_changed (an int).

Line 135 directly mutates cast_node.args tuple. Use update_arg for safer FX graph manipulation.
🔎 Suggested fix:
-    def _change_cast_bfloat16_to_float16(self, gm: GraphModule) -> bool:
+    def _change_cast_bfloat16_to_float16(self, gm: GraphModule) -> int:
         """Replace all bfloat16 cast operations with float16 casts.
         ...
         """
         graph = gm.graph
         cast_nodes = graph.find_nodes(op="call_function", target=torch.ops.aten.to.dtype)
         num_changed = 0
         for cast_node in cast_nodes:
             if cast_node.args[1] == torch.bfloat16:
-                cast_node.args = (cast_node.args[0], torch.float16)
+                cast_node.update_arg(1, torch.float16)
                 num_changed += 1
         return num_changed

🧹 Nitpick comments (12)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (2)
54-57: Consider using ad_logger instead of print for consistency.

Other transforms in this PR use ad_logger for logging. Using print() here is inconsistent and won't integrate with the logging configuration.
🔎 Suggested fix:
+from ...utils.logger import ad_logger
+
 # In the function:
-        print(
-            "Warning: head_dim not found in config, calculating as hidden_size // num_attention_heads"
-        )
+        ad_logger.warning(
+            "head_dim not found in config, calculating as hidden_size // num_attention_heads"
+        )
68-102: Consider extracting common config export logic.

_export_native_llm_config and _export_eagle_base_config share nearly identical logic for required field validation, head_dim handling, and partial_rotary_factor defaults. This duplication could be reduced with a shared helper.
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)
317-336: Consider caching is_vlm result to avoid duplicate calls.

is_vlm(model_dir) is called at line 319 and again at line 371. Since it loads the model config, caching the result would improve efficiency.
🔎 Suggested fix:
+    is_vlm_model = is_vlm(model_dir)
+
     tokenizer = None
     loaders = (
-        [AutoProcessor, AutoTokenizer] if is_vlm(model_dir) else [AutoTokenizer, AutoProcessor]
+        [AutoProcessor, AutoTokenizer] if is_vlm_model else [AutoTokenizer, AutoProcessor]
     )
     # ... later in the function ...
-    if is_vlm(model_dir):
+    if is_vlm_model:
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)
9-17: Use pytest's tmpdir fixture instead of hardcoded /tmp path.

Hardcoded temporary paths can cause issues with concurrent test execution and on systems with restricted /tmp access. The pytest tmpdir fixture provides isolated temporary directories for each test.
🔎 Refactor to use tmpdir fixture
 @pytest.mark.parametrize(
-    "model, max_batch_size, max_seq_len, output_dir, num_attn_ops",
+    "model, max_batch_size, max_seq_len, num_attn_ops",
     [
-        ("Qwen/Qwen2.5-0.5B", 13, 4, "/tmp/test_ad_export_onnx_qwen2.5-0.5b", 24),
+        ("Qwen/Qwen2.5-0.5B", 13, 4, 24),
     ],
 )
 def test_ad_export_onnx(
-    model: str, max_batch_size: int, max_seq_len: int, output_dir: str, num_attn_ops: int
+    model: str, max_batch_size: int, max_seq_len: int, num_attn_ops: int, tmp_path
 ):
+    output_dir = str(tmp_path / "test_ad_export_onnx_qwen2.5-0.5b")
     ad_config = AutoDeployConfig(
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (2)
77-81: Remove unused parameter or document why it's kept.

The position_ids parameter is declared but never used in the forward method. Either remove it from the signature or add a comment explaining why it's present (e.g., for API compatibility).
🔎 Option 1: Remove unused parameter
     def forward(
         self,
         input_ids: torch.Tensor,
-        position_ids: torch.Tensor,
     ) -> torch.Tensor:
🔎 Option 2: Document why it's kept
     def forward(
         self,
         input_ids: torch.Tensor,
-        position_ids: torch.Tensor,
+        position_ids: torch.Tensor,  # Required for export API compatibility
     ) -> torch.Tensor:
155-163: Remove unused tolerance parameters.

The atol and rtol parameters are declared but never used in _run_test. These appear to be leftover from a previous implementation.
🔎 Remove unused parameters
 def _run_test(
     head_dim: int,
     num_q_heads: int,
     num_kv_heads: int,
     batch_size: int,
     seq_len: int,
-    atol: float = 1e-3,
-    rtol: float = 1e-3,
 ):
And update the caller:
     _run_test(
         head_dim=head_dim,
         num_q_heads=num_q_heads,
         num_kv_heads=num_kv_heads,
         batch_size=batch_size,
         seq_len=seq_len,
-        atol=1e-2,
-        rtol=1e-2,
     )
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)
359-438: Frozen dataclass mutation is acceptable given the constraints.

The use of object.__setattr__ on line 437 to mutate out_spec.type is unconventional but justified by the detailed comment (lines 420-437). The ONNX export requires dict-type outputs, but the frozen TreeSpec prevents normal assignment. This is a pragmatic solution to a real constraint.

Consider documenting this behavior in the function docstring as well for better visibility.
🔎 Add note to docstring about type mutation
 def add_graph_output(gm: GraphModule, output_node: Node, name: str) -> None:
     """Add a graph output to the given GraphModule.
 
     This function appends a new output to the graph's output node and updates
     the pytree_info metadata accordingly.
+    
+    Note: The output spec type is forcibly changed to dict to support arbitrary
+    named outputs, which is required for ONNX export with custom output names.
 
     NOTE: function does NOT do any graph canonicalization. This is left to the user!
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)

124-130: Unused parameters are likely required by interface.

The parameters cm, factory, and shared_config are unused in _apply but are likely required by the BaseTransform interface signature. This is acceptable and follows the interface contract.

If these parameters are truly optional in the base class, consider adding a comment explaining they're not needed for this transform, or use underscore-prefixed names (_cm, _factory, _shared_config) to indicate they're intentionally unused.
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (1)
500-505: Minor: Use tuple unpacking for clarity.

Consider using tuple unpacking instead of concatenation for slightly more idiomatic Python.
🔎 Suggested fix:
         for node in sym_size_int_nodes:
             if node.args[0] == position_ids_node:
-                new_args = (input_ids_node,) + node.args[1:]
+                new_args = (input_ids_node, *node.args[1:])
                 node.args = new_args
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (3)
432-438: Use ad_logger instead of print for consistency.

The codebase uses ad_logger for logging throughout. Consider replacing these print statements with ad_logger.debug for consistency with the rest of the module.
🔎 Suggested fix:
-        print("Placeholders args:")
+        ad_logger.debug("Placeholders args:")
         for i, e in enumerate(args):
-            print(f"  {i}: {placeholders[i].name:20} {e}")
+            ad_logger.debug(f"  {i}: {placeholders[i].name:20} {e}")
 
-        print("Placeholders kwargs:")
+        ad_logger.debug("Placeholders kwargs:")
         for k, v in kwargs.items():
-            print(f"  {k}: {v}")
+            ad_logger.debug(f"  {k}: {v}")
454-470: Consider making hardcoded dimension limits configurable.

The hardcoded max values for rope_batch_size (16), max_position_embeddings (4096), and past_len (4096) may be limiting for some models. Consider deriving these from model configuration or making them configurable.

Verify that these limits are sufficient for target models. If models require larger values, these hardcoded limits would cause export failures or runtime errors.

534-539: Use ad_logger instead of print for consistency.

Similar to the earlier observation, these print statements should use ad_logger for consistent logging.
🔎 Suggested fix:
         if reduced_vocab_size is not None:
             model_config["reduced_vocab_size"] = reduced_vocab_size
-            print(f"Added reduced_vocab_size={reduced_vocab_size} to config")
+            ad_logger.info(f"Added reduced_vocab_size={reduced_vocab_size} to config")
 
         config_path = os.path.join(output_dir, "config.json")
         with open(config_path, "w") as f:
             json.dump(model_config, f, indent=2)
-        print(f"Model configuration saved to {config_path}")
+        ad_logger.info(f"Model configuration saved to {config_path}")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6cf560 and b2fc9a2.

📒 Files selected for processing (22)

.gitignore (1 hunks)
docker/common/install_base.sh (1 hunks)
examples/auto_deploy/onnx_export_llm.py (1 hunks)
requirements.txt (1 hunks)
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/llm_args.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/utils/_graph.py (4 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py (4 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
.gitignore
docker/common/install_base.sh
tensorrt_llm/_torch/auto_deploy/transform/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tensorrt_llm/_torch/auto_deploy/utils/_graph.py
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tensorrt_llm/_torch/auto_deploy/utils/_graph.py
tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

🧠 Learnings (31)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

📚 Learning: 2025-12-19T06:31:54.973Z

Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:54.973Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, the cast to torch.float16 for qkv_node before creating the AttentionPlugin is intentional and required because DriveOS LLM expects float16 dtype specifically. This should not be changed to preserve original dtype or made configurable for bfloat16 models in the DriveOS LLM ONNX export path.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-12-17T22:39:44.244Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.{cpp,h,cu,cuh,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

📚 Learning: 2025-10-22T06:53:47.017Z

Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

📚 Learning: 2025-12-10T19:14:58.432Z

Learnt from: tburt-nv
Repo: NVIDIA/TensorRT-LLM PR: 9881
File: cpp/kernels/fmha_v2/train_ops/train_setup.py:35-36
Timestamp: 2025-12-10T19:14:58.432Z
Learning: In cpp/kernels/fmha_v2/train_ops/train_setup.py, the embedded C++ template copyright headers (fmha_dgrad_v2_flash_attention_template and fmha_fprop_v2_flash_attention_template) use copyright dates "2011-2023" because the file was imported without meaningful changes since its original creation. The original copyright dates are preserved for historical accuracy rather than updated to the current year.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

📚 Learning: 2025-12-17T22:39:44.244Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.h : Use a preprocessor guard in C++ header files with the format `TRTLLM_<FILENAME>_H` derived from the filename in all caps

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
examples/auto_deploy/onnx_export_llm.py

📚 Learning: 2025-12-17T22:39:44.244Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.py : Code developed for TensorRT-LLM should conform to Python 3.8+

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

📚 Learning: 2025-12-17T22:39:44.244Z

Learnt from: CR
Repo: NVIDIA/TensorRT-LLM PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-12-17T22:39:44.244Z
Learning: Applies to **/*.h : The preprocessor guard name in C++ must have prefix `TRTLLM_` followed by the filename, all in caps. Only use the file name, not directory names

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
examples/auto_deploy/onnx_export_llm.py

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
Repo: NVIDIA/TensorRT-LLM PR: 7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
examples/auto_deploy/onnx_export_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

📚 Learning: 2025-10-20T17:09:21.560Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-11-12T17:28:52.144Z

Learnt from: cheshirekow
Repo: NVIDIA/TensorRT-LLM PR: 9016
File: 3rdparty/README.md:20-20
Timestamp: 2025-11-12T17:28:52.144Z
Learning: In the TensorRT-LLM repository, "nspect" is the name of an internal tool used for detecting package installations in containers. It should not be flagged as a typo.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py

📚 Learning: 2025-08-08T05:06:31.596Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

examples/auto_deploy/onnx_export_llm.py

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

examples/auto_deploy/onnx_export_llm.py

📚 Learning: 2025-08-14T15:43:23.107Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-12-12T10:07:36.866Z

Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:36.866Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do NOT require a register_fake decorator. The mutation tracking is handled automatically without needing a FakeTensor kernel, as documented in the PyTorch tutorial on mutable Python custom operators.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

📚 Learning: 2025-12-19T06:31:46.370Z

Learnt from: nvyocox
Repo: NVIDIA/TensorRT-LLM PR: 10117
File: tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py:336-339
Timestamp: 2025-12-19T06:31:46.370Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py, ensure that the qkv_node is cast to torch.float16 before creating the AttentionPlugin. This casting is intentional and required because DriveOS LLM expects float16 dtype. Do not revert to the original dtype, and do not make this behavior configurable for bfloat16 models within the DriveOS LLM ONNX export path. If needed, document the rationale in code comments and add a direct assertion to prevent accidental changes to the dtype before plugin creation.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-08-09T02:04:49.623Z

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 6760
File: tensorrt_llm/_torch/auto_deploy/models/quant_config_reader.py:81-98
Timestamp: 2025-08-09T02:04:49.623Z
Learning: In TensorRT-LLM's auto_deploy module, torch.dtype values in configuration dictionaries must be stored as string representations (e.g., "float16" instead of torch.float16) because OmegaConf.merge does not support torch.dtype types. These string representations are converted to actual torch.dtype objects in downstream code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-27T15:03:57.149Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:368-392
Timestamp: 2025-08-27T15:03:57.149Z
Learning: In TensorRT-LLM's sampler.py, int32 usage for softmax_indices and related tensor indexing is intentional and should not be changed to int64. The torch.IntTensor type hint is correct for the sample() function's softmax_indices parameter.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

🧬 Code graph analysis (7)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (1)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

args (28-30)

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

CachedSequenceInterface (11-92)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (2)

add_graph_input (265-324)

run_shape_prop (237-262)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py (1)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

AutoDeployConfig (54-339)

to_llm_kwargs (315-325)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

get (526-528)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

to_dict (311-313)

to_dict (446-451)

tensorrt_llm/bench/benchmark/__init__.py (1)

model_type (70-71)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (3)

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (1)

process_chat_template (300-443)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (1)

export_llm_config (175-203)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

get_config_class (122-123)

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (4)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

to_dict (311-313)

to_dict (446-451)

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

get (526-528)

tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (1)

placeholder (229-247)

tests/unittest/llmapi/apps/test_chat_utils.py (1)

chat_template_path (188-193)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

torch_attention (96-212)

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

torch_export_to_gm (276-344)

tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)

InferenceOptimizer (25-93)

🪛 Ruff (0.14.8)

tensorrt_llm/_torch/auto_deploy/utils/_graph.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)

391-391: Consider (*tuple(current_outputs), output_node) instead of concatenation

Replace with (*tuple(current_outputs), output_node)

(RUF005)

464-467: Avoid specifying long messages outside the exception class

(TRY003)

472-475: Avoid specifying long messages outside the exception class

(TRY003)

480-484: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

150-150: Unused method argument: cm

(ARG002)

151-151: Unused method argument: factory

(ARG002)

152-152: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py

86-86: Consider moving this statement to an else block

(TRY300)

449-449: Loop control variable target_name not used within loop body

(B007)

449-449: Loop control variable input_idx not used within loop body

(B007)

535-535: Do not catch blind exception: Exception

(BLE001)

648-648: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py

136-136: Unused method argument: cm

(ARG002)

137-137: Unused method argument: factory

(ARG002)

138-138: Unused method argument: shared_config

(ARG002)

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_export_onnx.py

12-12: Probable insecure usage of temporary file or directory: "/tmp/test_ad_export_onnx_qwen2.5-0.5b"

(S108)

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py

33-33: Unused function argument: context_lengths

(ARG001)

34-34: Unused function argument: rope_rotary_cos_sin

(ARG001)

35-35: Unused function argument: kvcache_start_index

(ARG001)

37-37: Unused function argument: enable_tree_attention

(ARG001)

38-38: Unused function argument: head_size

(ARG001)

39-39: Unused function argument: num_kv_heads

(ARG001)

40-40: Unused function argument: num_q_heads

(ARG001)

81-81: Unused function argument: context_lengths

(ARG001)

82-82: Unused function argument: rope_rotary_cos_sin

(ARG001)

83-83: Unused function argument: kvcache_start_index

(ARG001)

84-84: Unused function argument: enable_tree_attention

(ARG001)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py

246-246: Do not catch blind exception: Exception

(BLE001)

271-271: Do not catch blind exception: Exception

(BLE001)

293-293: Unused method argument: cm

(ARG002)

367-367: Unused method argument: cm

(ARG002)

502-502: Consider (input_ids_node, *node.args[1:]) instead of concatenation

Replace with (input_ids_node, *node.args[1:])

(RUF005)

511-511: Unused method argument: factory

(ARG002)

512-512: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py

127-127: Unused method argument: cm

(ARG002)

128-128: Unused method argument: factory

(ARG002)

129-129: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py

38-38: Avoid specifying long messages outside the exception class

(TRY003)

45-45: Avoid specifying long messages outside the exception class

(TRY003)

85-85: Avoid specifying long messages outside the exception class

(TRY003)

121-121: Avoid specifying long messages outside the exception class

(TRY003)

135-135: Avoid specifying long messages outside the exception class

(TRY003)

164-166: Avoid specifying long messages outside the exception class

(TRY003)

195-195: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py

422-422: Unused method argument: factory

(ARG002)

423-423: Unused method argument: shared_config

(ARG002)

544-544: Unused method argument: cm

(ARG002)

546-546: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py

131-131: Do not catch blind exception: Exception

(BLE001)

149-154: Avoid specifying long messages outside the exception class

(TRY003)

251-251: Avoid specifying long messages outside the exception class

(TRY003)

257-257: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

257-257: Avoid specifying long messages outside the exception class

(TRY003)

261-263: Prefer TypeError exception for invalid type

(TRY004)

261-263: Avoid specifying long messages outside the exception class

(TRY003)

269-269: Avoid specifying long messages outside the exception class

(TRY003)

330-330: Do not catch blind exception: Exception

(BLE001)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py

26-26: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

27-27: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

80-80: Unused method argument: position_ids

(ARG002)

161-161: Unused function argument: atol

(ARG001)

162-162: Unused function argument: rtol

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (41)

requirements.txt (1)

13-14: LGTM!

The onnxscript dependency is properly pinned to version 0.5.4 for reproducibility, addressing the previous review feedback. Keeping graphviz unpinned is acceptable given its stable release status.

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py (3)

1-14: LGTM!

The NVIDIA copyright header is properly included, addressing the previous review feedback.

94-131: LGTM!

The GatherND insertion logic correctly manipulates the FX graph, inserting the unsqueeze and GatherND nodes before the linear layer and properly rewiring the inputs.

133-173: LGTM!

The _apply orchestration is well-structured with clear steps and proper cleanup. The unused arguments (cm, factory, shared_config) are required by the BaseTransform interface signature.

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py (3)

1-26: LGTM!

The copyright header and imports are properly structured. The class docstring clearly documents the four transformations performed for DriveOS compatibility.

69-116: LGTM!

The method correctly identifies attention plugin outputs, traces through getitem and reshape nodes, and inserts float16 casts. The replace_all_uses_with followed by update_arg is the proper pattern to avoid circular references.

147-167: LGTM!

The _apply method correctly orchestrates all adaptation steps. The unused arguments are required by the BaseTransform interface.

tensorrt_llm/_torch/auto_deploy/transform/library/_config_export.py (3)

1-18: LGTM!

Copyright header and version constant are properly defined.

157-172: LGTM!

The vision config export correctly validates the presence of vision configuration and preserves the original config for MRoPE compatibility.

175-203: LGTM!

The dispatch logic correctly routes to the appropriate exporter based on model type and handles multimodal models by extracting text_config.

tensorrt_llm/_torch/auto_deploy/transform/library/_chat_template.py (3)

1-14: LGTM!

Copyright header and module docstring are properly structured.

107-155: LGTM!

The fallback pattern for handling tokenizers that don't support multimodal content is appropriate. The inner exception properly chains the original error.

340-443: LGTM!

The chat template extraction logic is well-structured, correctly handling system/user/assistant roles, multimodal content patterns for VLMs, and default system prompt detection. The output JSON structure is comprehensive.

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

154-154: LGTM! Mode extension follows existing patterns.

The addition of "export_driveos_llm_onnx" to the mode literal and its corresponding YAML mapping is consistent with the existing architecture and properly extends the configuration system for the new ONNX export workflow.

Also applies to: 337-337

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rope_attention.py (1)

26-27: noqa directives are actually needed despite Ruff warning.

The # noqa: F401 comments are intentionally suppressing unused import warnings because these imports register custom ops (torch.ops.auto_deploy.*) via side effects. Ruff 0.14.8's RUF100 rule incorrectly flags these as unnecessary.

The imports serve to register the custom operators before the test runs, and removing the noqa would cause linters to flag them as unused imports. Keep the noqa directives as-is.

tensorrt_llm/_torch/auto_deploy/config/export_driveos_llm_onnx.yaml (1)

1-148: LGTM! Well-structured ONNX export configuration.

The YAML configuration provides a comprehensive pipeline for DriveOS LLM ONNX export with:

Clear stage organization (factory, export, pattern_matcher, sharding, etc.)

Good documentation (e.g., lines 56-60 explaining why optimize_rope is disabled)

Consistent structure matching existing configs

The export pipeline sequence (fuse_rope_attention → short_reshape_attention_output → gather_last_token_ids → adapt_to_driveos_llm → export_to_onnx) logically flows from graph optimization to final export.

tensorrt_llm/_torch/auto_deploy/utils/_graph.py (3)

25-34: LGTM! Recursive post-init utility is well-designed.

The _call_post_init_recursive helper correctly traverses the TreeSpec hierarchy from leaves to root, ensuring internal cached values (like num_leaves) are updated after modifying children_specs. This is essential for maintaining pytree integrity.

265-324: LGTM! name_prefix addition improves flexibility.

Adding the name_prefix parameter allows customization of placeholder naming while maintaining backward compatibility with the default "arg_" prefix. The implementation correctly applies the prefix consistently across placeholder creation, orig_args, and context.

440-525: LGTM! Input removal correctly handles args vs kwargs distinction.

The remove_graph_input function properly addresses the past review concern about index mismatches by:

Computing the global placeholder index (line 493)

Determining if it's an arg or kwarg based on position (lines 496-508)

Computing a relative index for the appropriate spec (lines 501, 506)

Using relative_idx for all spec operations (lines 517-520)

This ensures correct removal regardless of whether the input is a positional arg or kwarg.

tensorrt_llm/_torch/auto_deploy/transform/graph_module_visualizer.py (2)

175-538: LGTM! Comprehensive visualization implementation.

The to_dot function provides sophisticated graph visualization with:

Intelligent port assignment for multi-input/output nodes

Edge coloring and width based on tensor properties

Special handling for large fan-in/fan-out nodes via duplication

Constant node creation for non-Node inputs

The magic number 9223372036854775807 (2^63-1) handling on lines 298-299, 304-305 is a valid workaround for what appears to be a torch.SymInt representation issue where -1 gets converted to LONG_MAX.

The broad exception catching on line 535 is acceptable for visualization code, as failures shouldn't break the pipeline. Good practice to log the error and save the text dump as fallback.

541-687: LGTM! Graph analysis utility provides useful metrics.

The analyze_graph_structure function computes meaningful graph statistics (node counts, connections, complexity, depth) that can aid in debugging and optimization decisions. The recursive depth calculation correctly handles nested dependencies.

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py (1)

124-171: LGTM! Reshape optimization correctly handles dynamic shapes.

The _apply method effectively:

Identifies reshape nodes following AttentionPlugin outputs (lines 132-133)

Extracts dynamic dimensions from the input tensor using sym_size.int (lines 143-144)

Creates new reshape nodes with symbolic shape [dim0, dim1, -1] (lines 147-153)

Properly replaces and removes old nodes (lines 156-159)

Runs post-pass cleanup and shape propagation (lines 163-166)

This ensures reshapes handle dynamic batch and sequence dimensions correctly for ONNX export.

examples/auto_deploy/onnx_export_llm.py (1)

38-42: The batch_size dimension collapse with small values (e.g., max_batch_size=2) is a known limitation in PyTorch's ONNX export when using small max values in dynamic shape definitions. The documented workaround using batch_size=13 is appropriate and correctly implemented. The dynamic shape configuration in export_to_onnx.py properly specifies symbolic dimensions per PyTorch standards. No code changes required; the existing comment adequately documents the workaround and its rationale.

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_attention.py (6)

1-13: LGTM! Copyright header added.

The NVIDIA copyright header is now properly included at the top of the file.

29-93: LGTM!

The MatchResult class is well-documented with clear attributes and a useful __repr__ for debugging.

123-291: LGTM!

The pattern matching logic is thorough with good step-by-step validation and informative debug logging. The broad exception handling is acceptable here since pattern matching should be resilient to unexpected graph structures.

293-362: LGTM!

The placeholder creation correctly derives symbolic dimensions from the token_ids tensor and maintains proper shape propagation.

364-474: LGTM!

The replacement logic correctly creates the fused AttentionPlugin with proper input wiring, output handling, and dead code elimination. Based on learnings, the float16 cast is intentional for DriveOS LLM compatibility.

507-556: LGTM!

The orchestration method correctly sequences the transform stages: pattern matching, placeholder creation, replacement, cleanup, recompilation, linting, and shape propagation. The unused factory and shared_config arguments are interface requirements from BaseTransform.

tensorrt_llm/_torch/auto_deploy/custom_ops/onnx_attention.py (5)

1-21: LGTM! Copyright header and module docstring.

The file now includes the required NVIDIA copyright header and has a clear module-level docstring explaining the purpose of these placeholder custom operations.

28-74: LGTM!

The attention_plugin custom op is correctly registered as a placeholder with comprehensive docstrings. The unused arguments are expected since this is a placeholder for ONNX export.

77-117: LGTM!

The fake implementation correctly computes output shapes for torch.compile tracing, properly handling KV-cache accumulation with present_kv_len = seq_len + past_len.

120-147: LGTM!

The _dummy_gather_nd helper now correctly handles both 2D and 3D indices with a unified shape computation, addressing the previous issue about the missing return for 2D indices.

150-195: LGTM!

The gather_nd and gather_nd_fake implementations correctly delegate to the shared _dummy_gather_nd helper, maintaining consistency between the actual op and its fake implementation.

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_onnx.py (7)

1-42: LGTM! Copyright header and imports.

The file now includes the required NVIDIA copyright header and has appropriate imports for ONNX schema registration and export functionality.

44-58: LGTM!

The ExportToONNXConfig class provides clean configuration with sensible defaults and proper field descriptions.

60-117: LGTM!

The custom RoPE schema is well-defined with proper type constraints and the translation function correctly creates the ONNX custom op node.

120-142: LGTM!

The custom_simple_linear_op correctly handles the weight transpose and optional bias addition, while custom_gather_nd_op properly delegates to the standard ONNX GatherND op.

145-271: LGTM!

The AttentionPlugin schema is comprehensive with proper input/output specifications, type constraints, and required attributes. The translation function correctly maps to the TRT domain opset.

273-405: LGTM!

The torch_attention schema correctly defines the SDPA operation with GQA support, and the translation function properly handles the is_causal bool-to-int conversion for ONNX compatibility.

562-588: LGTM!

The _apply method correctly orchestrates the export workflow, creates necessary directories, and returns the original GraphModule unchanged since this is an export-only transform.

tensorrt_llm/_torch/auto_deploy/transform/library/adapt_to_driveos_llm.py

tensorrt_llm/_torch/auto_deploy/transform/library/gather_last_token_ids.py

tensorrt_llm/_torch/auto_deploy/transform/library/short_reshape_attention_output.py

[none][feat] Add AutoDeploy export-onnx mode Add a new mode "export-onnx" to AutoDeploy. The mode is almost identical to the default one with 2 difference: 1. Fuse torch_rope_with_explicit_cos_sin & torch_cached_attention_with_cache into onnx_rope_attnetion 2. The result is not TRT Engine but .onnx Files added: - export_onnx.py: The transformation to fuse the ops - graph_module_visualizer.py: Convert GraphModule to .dot - examples/onnx_export_llm.py: Example usage - onnx_driveos_llm.yaml: The new mode config file - onnx_attnetion.py: The definition of the fused op [none][feat] fix small graphviz bug, remove useless code [none][feat] Rename mode from onnx_driveos_llm to export_driveos_llm_onnx [none][feat] Rename export_onnx.py to fuse_rope_attention.py [none][feat] Annotate .meta['val'] with add_graph_input() [none][feat] Successfully export .onnx [none][feat] Add set_kvcache_placeholder_metadata transform [none][feat] Skip torch_cached_attention_prepare_metadata [none][feat] Fix SetKVCachePlaceholderMetadata transform [none][feat] Remove unused placeholder of prepare_metadata [none][feat] Fix to run DeepSeek-R1 [none][feat] Add remove_graph_input, refactor remove_unused_placeholder() [none][feat] Merge K&V cache placeholder [none][feat] Replace sin_cos with input [none][feat] Manually fuse rope & attn [none][feat] Export torch_attention_bsnd_grouped_sdpa with dynamic shape [none][feat] Manually match rope & attn, not replace yet [none][feat] Successfully export ONNX with dynamic input [none][feat] Hack out_spec to add graph output [none][feat] Fix present_key_values shape [none][feat] Fix input & output names [none][feat] Change out_spec in add_graph_output [none][feat] Fix export of torch_linear_simple The original translation misses a transpose on the weight. [none][feat] Fix present_key_values shape [none][feat] Rewire reshape's new shape as TRT-LLM edge [none][feat] Fix non-text rebase conflicts [none][feat] Fix AttentionPlugin domain. should be "" not "ai.onnx" [none][feat] Enhance visualize, use .meta["val"] instead of .meta["tensor_meta"] [none][feat] Fix visualize tensor width calculation When calculate the width of the tensor, check it the dimension is a int or SymInt. The original implementation accidentally introduce constraints to the symbol int. I don't execlty know how it happen. actually I don't think it should introduce new constraints, but it dose. [none][feat] Fix output dynamic batch_size Originally max batch size is 2, however, don't know why, when set to 2,the batch_size will collapse to literal static int 2 even we explicitly it is dynamic axis. And more weird, when set to 13, the batch_size will be dynamic. default=13, # to enable dynamic batch_size, the match size must > 1 [none][feat] Rename fuse_rope_attention_manually to fuse_rope_attention [none][feat] Remove fuse_rope_attention.py [none][feat] Rewire reshape to make the graph like Luxiao's [none][feat] Fix last_token_ids dtype from i32 to i64 [none][feat] Catch up update to date DriveOS LLM - Add placeholder kvcache_start_index - AttentionPlugin add input kvcache_start_index - Insert Unsqueeze -1 before GatherND - rope_rotary_cos_sin dynamic axis name changed from rope_max_position_length to max_position_embeddings - logits' dtype should be float32, insert a cast - Insert cast to f16 before AttentionPlugin - All cast to bf16 should be f16 [none][feat] Catch up update to date DriveOS LLM - model.half() convert whole model to f16, including weight - Remove AttentionPlugin attribute kv_cache_capacity & max_batch_size - AttentionPlugin output[1] shape infer by seq_len + past_len - AttentionPlugin domain changed from `onnx.ai` to `trt` - Placeholder `kvcache_start_index` dynamic axes changed from `batch_size` to `kv_cache_start_batch_size` [none][feat] Catch up-to-date main [none][feat] Add test for fuse_rope_attention transform - Add test for fuse_rope_attention - Enhance run_test_transformed_gm support Module with multiple input - Fix add_graph_output for graph with only one _LEAF_SPEC [none][feat] Add unit test for fuse_rope_attn - Add a unit test - Fix add_graph_output when out_spec is _LEAF_SPEC [none][feat] Export .json files [none][feat] add AutoDeploy export onnx end-to-end test [none][feat] Export ONNX with cpu to reduce GPU memory footprint [none][feat] Use model.config to get head_dim, instead of using literal [none][feat] Visualize graph only when env var AD_DEBUG_VISUALIZE_DIR is set - Now we don't visualize by default, only when AD_DEBUG_VISUALIZE_DIR is set. - Also, AD_DEBUG_VISUALIZE_DIR is the output dir, so you can specify the output dir - Simplify the logging message, move lots of msg to debug instead of info - Add .cursor to .gitignore Signed-off-by: yocox <[email protected]>

Signed-off-by: yocox <[email protected]>

The dimension is wrong, it should be num_kv_head. but it is hardwired to 2. Signed-off-by: yocox <[email protected]>

Signed-off-by: yocox <[email protected]>

- Remove "arg_" prefix for add_graph_input(). See NVIDIA#10117 (comment) Remove all prefix in the code to fix the regression. - Replace test_ad_export_onnx model path from "Qwen/Qwen2.5-0.5B" to "/home/scratch.trt_llm_data/llm-models/Qwen2.5-0.5B-Instruct" to avoid "huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests ..." error. Signed-off-by: yocox <[email protected]>

- Simplify adapt_to_driveos_llm clean up graph. Remove: gm.graph.eliminate_dead_code() gm.graph.lint() gm.recompile() and Use is_clean=False instead. - Adapt transformations from main: - match_bmm_moe_pattern - fuse_fp8_moe - fuse_nvfp4_moe - Enhance comment for debug_visualize_dir option Signed-off-by: yocox <[email protected]>

Signed-off-by: yocox <[email protected]>

- Specify model path with get_small_model_config - Simplify test code, make batch size and seq not argument Signed-off-by: yocox <[email protected]>

Signed-off-by: yocox <[email protected]>

nvyocox · 2026-01-30T05:25:53Z

/bot run

tensorrt-cicd · 2026-01-30T05:31:42Z

PR_Github #34175 [ run ] triggered by Bot. Commit: c90c23b

tensorrt-cicd · 2026-01-30T08:29:44Z

PR_Github #34175 [ run ] completed with state SUCCESS. Commit: c90c23b
/LLM/main/L0_MergeRequest_PR pipeline #26368 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nvyocox · 2026-01-30T14:16:37Z

/bot run

tensorrt-cicd · 2026-01-30T14:22:51Z

PR_Github #34224 [ run ] triggered by Bot. Commit: c90c23b

tensorrt-cicd · 2026-01-30T14:22:52Z

PR_Github #34224 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 7 AM PST on 1/30.

lucaslie

That was a great last minute addition to add the explicit export_onnx function :)

lucaslie · 2026-01-30T14:47:22Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-01-30T14:54:24Z

PR_Github #34227 [ run ] triggered by Bot. Commit: c90c23b

tensorrt-cicd · 2026-01-30T14:54:25Z

PR_Github #34227 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 8 AM PST on 1/30.

lucaslie · 2026-01-30T16:13:35Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-01-30T16:20:08Z

PR_Github #34235 [ run ] triggered by Bot. Commit: c90c23b

tensorrt-cicd · 2026-01-30T20:43:09Z

PR_Github #34235 [ run ] completed with state SUCCESS. Commit: c90c23b
/LLM/main/L0_MergeRequest_PR pipeline #26404 completed with status: 'SUCCESS'

nvyocox requested review from a team as code owners December 18, 2025 09:41

nvyocox requested review from Fridah-nv, QiJune, Shixiaowei02 and poweiw December 18, 2025 09:41

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 18, 2025

Fridah-nv requested review from lucaslie and nvchenghaoz December 18, 2025 17:50

nvyocox changed the title ~~[none][feat] Export ONNX for DriveOS LLM~~ [draft][none][feat] Export ONNX for DriveOS LLM Dec 19, 2025

nvyocox changed the title ~~[draft][none][feat] Export ONNX for DriveOS LLM~~ [none][feat] Export ONNX for DriveOS LLM Dec 19, 2025

nvyocox marked this pull request as draft December 19, 2025 01:41

nvyocox force-pushed the export-driveos-llm-onnx branch 2 times, most recently from 3fd31c8 to 249cdb1 Compare December 19, 2025 05:47

nvyocox changed the title ~~[none][feat] Export ONNX for DriveOS LLM~~ [None][feat] Export ONNX for DriveOS LLM Dec 19, 2025

nvyocox force-pushed the export-driveos-llm-onnx branch 4 times, most recently from 7c87b13 to b2fc9a2 Compare December 19, 2025 07:33

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

nvyocox force-pushed the export-driveos-llm-onnx branch 3 times, most recently from db15c67 to 59a90e4 Compare December 19, 2025 08:42

nvyocox added 13 commits January 29, 2026 21:19

[None][feat] Update per pohan's review

7a7aa94

Signed-off-by: yocox <[email protected]>

[none][feat] Update code per lucas's review

25dd693

Signed-off-by: yocox <[email protected]>

[none][feat] Update code per pohan & lucas's review

a247564

Signed-off-by: yocox <[email protected]>

[None][feat] Fix past_key_value_cache_XX axis 2

de9b869

The dimension is wrong, it should be num_kv_head. but it is hardwired to 2. Signed-off-by: yocox <[email protected]>

[None][feat] Fix pattern match & rewrite for qwen3

e7404a5

Signed-off-by: yocox <[email protected]>

[None][feat] Visualize add tensor date type

14a0cbf

Signed-off-by: yocox <[email protected]>

[None][feat] Fix regression

258595e

- Specify model path with get_small_model_config - Simplify test code, make batch size and seq not argument Signed-off-by: yocox <[email protected]>

[None][feat] Rename DriveOS LLM to EdgeLLM

048d736

Signed-off-by: yocox <[email protected]>

[None][feat] Update CI docker image

26cb437

Signed-off-by: yocox <[email protected]>

[None][feat] Fix CI after rebase per Lucas suggest

c90c23b

Signed-off-by: yocox <[email protected]>

nvyocox force-pushed the export-driveos-llm-onnx branch from 9835586 to c90c23b Compare January 30, 2026 05:22

lucaslie approved these changes Jan 30, 2026

View reviewed changes

lucaslie enabled auto-merge (squash) January 30, 2026 14:48

lucaslie merged commit 4af4720 into NVIDIA:main Jan 30, 2026
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Jan 30, 2026

[None][feat] Export ONNX for DriveOS LLM #10117

[None][feat] Export ONNX for DriveOS LLM #10117

Conversation

nvyocox commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The issue

The solution

Visualization

Test Coverage

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvyocox commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvyocox commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

nvyocox commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

lucaslie commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

Uh oh!

nvyocox commented Dec 18, 2025 •

edited

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading