[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU #11059

taylor-yb-lee · 2026-01-28T07:47:12Z

Improved the AutoDeploy backend loading efficiency by replacing the accelerate module's parameter-wise verification with a file-based preloading strategy. By loading weight files directly to the CPU before transferring them to the device, we bypass the overhead of individual parameter checks. For models exceeding 40K parameters, this change reduced loading latency by 80%.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

Chores
- Optimized HuggingFace model checkpoint loading with configurable CPU preload behavior. Users can now control whether checkpoints are preloaded to CPU or loaded directly onto the target device, improving efficiency and memory management based on specific deployment scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

lucaslie

looks great to me overall

lucaslie · 2026-01-28T14:25:20Z

tensorrt_llm/_torch/auto_deploy/models/hf.py

+        # Choose loading method based on environment variable
+        # Default behavior: preload checkpoint files to CPU
+        # Set AD_DISABLE_PRELOAD=1 to use accelerate's load_checkpoint_in_model (no CPU preload)
+        disable_preload = os.environ.get("AD_DISABLE_PRELOAD", "0") == "1"


Do you want to keep this configurability or remove it?

Seems to me we don't need to keep it around

I was not sure, but would it be useful for trying turning off the preloading on host machines w/ small memory?
(Though PT backend is not allowing turning this off though..)

To continue to the discussion in https://github.com/NVIDIA/TensorRT-LLM/pull/11045/changes#r2744152312

tensorrt_llm/_torch/auto_deploy/models/hf.py

coderabbitai · 2026-01-28T19:27:42Z

📝 Walkthrough

Walkthrough

Modified checkpoint loading in HuggingFace model factory to support configurable preload behavior. When disabled via AD_DISABLE_PRELOAD flag, uses accelerate's direct device loading; when enabled, preloads checkpoint to CPU memory first. Added support for loading from index.json files by aggregating weights across referenced checkpoints.

Changes

Cohort / File(s)	Summary
Checkpoint Loading Refactor `tensorrt_llm/_torch/auto_deploy/models/hf.py`	Added configurable CPU preload flow with new helper methods `_load_checkpoint_with_preload()` and `_load_full_checkpoint_to_cpu()`. Extended logic to support index.json-based checkpoint aggregation. Added imports: `json`, `safetensors.torch`. Introduced AD_DISABLE_PRELOAD flag to toggle between direct device loading and CPU preload paths. Enhanced logging for rank, preload decisions, and progress tracking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Factory as AutoModelForCausalLMFactory
    participant Accelerate as accelerate
    participant CPUMem as CPU Memory
    participant Model as nn.Module
    participant Checkpoint as Checkpoint Files

    User->>Factory: Load HF model with checkpoint
    Factory->>Factory: Check AD_DISABLE_PRELOAD flag
    
    alt AD_DISABLE_PRELOAD enabled (default)
        Factory->>Checkpoint: Read checkpoint/index.json
        Checkpoint-->>Factory: Checkpoint data/file list
        Factory->>CPUMem: Load full checkpoint to CPU
        CPUMem-->>Factory: Aggregated weights dict
        Factory->>Model: Load weights into model
        Model-->>Factory: Model loaded
        Factory->>CPUMem: Free CPU memory
    else AD_DISABLE_PRELOAD disabled
        Factory->>Accelerate: Use load_checkpoint_in_model
        Accelerate->>Checkpoint: Load directly on device
        Checkpoint-->>Accelerate: Weights on device
        Accelerate->>Model: Populate model
        Model-->>Accelerate: Model loaded
    end
    
    Factory-->>User: Model ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description has a clear opening statement explaining the optimization strategy and performance improvement, but lacks required sections from the template: 'Description' and 'Test Coverage' sections are present as commented placeholders only, not filled with substantive content. The PR Checklist is partially completed.	Provide detailed explanations in the 'Description' section about the issue being solved and the solution implemented. Clearly document test coverage by listing relevant tests that validate the new preloading logic and ensure backward compatibility with the accelerate module path.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: optimizing Auto Deploy weight loading through CPU preloading, which aligns with the implementation of _load_checkpoint_with_preload and the configurable AD_DISABLE_PRELOAD flow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tensorrt_llm/_torch/auto_deploy/models/hf.py`:
- Around line 460-477: The method _load_checkpoint_with_preload currently
ignores the device parameter when placing tensors; wrap the
model.load_state_dict(all_weights, strict=False) call with the same
hf_load_state_dict_with_device(device) context manager used elsewhere (or
otherwise map/move the CPU-stored tensors to the target device before/while
loading) so weights land on the requested device; reference the helper
hf_load_state_dict_with_device and the loader _load_full_checkpoint_to_cpu to
locate where to add the context manager and then delete all_weights after
loading as before.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
479-521: Implementation is sound with appropriate format handling.

The method correctly handles both index.json multi-file checkpoints and single checkpoint files. Good security practice using weights_only=True for torch.load.

Minor style note: Static analysis flagged long exception messages on lines 519 and 521. Consider extracting to a custom exception class if this pattern is used elsewhere, but this is not critical.
♻️ Optional: Extract exception messages to improve readability
+class CheckpointLoadError(ValueError):
+    """Exception for checkpoint loading failures."""
+    pass
+
 def _load_full_checkpoint_to_cpu(self, ckpt_file: str) -> dict:
     """Load the full checkpoint to CPU memory."""
     # ... existing code ...
             else:
-                raise ValueError(f"Unsupported checkpoint format: {ckpt_file}")
+                raise CheckpointLoadError(f"Unsupported checkpoint format: {ckpt_file}")
         else:
-            raise ValueError(f"Checkpoint file not found or unsupported: {ckpt_file}")
+            raise CheckpointLoadError(f"Checkpoint file not found or unsupported: {ckpt_file}")

tensorrt_llm/_torch/auto_deploy/models/hf.py

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

taylor-yb-lee · 2026-01-30T00:54:07Z

/bot run

tensorrt-cicd · 2026-01-30T01:00:36Z

PR_Github #34146 [ run ] triggered by Bot. Commit: 2e9b2e8

tensorrt-cicd · 2026-01-30T03:50:55Z

PR_Github #34146 [ run ] completed with state SUCCESS. Commit: 2e9b2e8
/LLM/main/L0_MergeRequest_PR pipeline #26344 completed with status: 'SUCCESS'

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

Optimize Auto Deploy weight loading by preloading weights to CPU

9b9500b

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

taylor-yb-lee requested a review from lucaslie January 28, 2026 07:53

taylor-yb-lee added the AutoDeploy <NV> AutoDeploy Backend label Jan 28, 2026

lucaslie reviewed Jan 28, 2026

View reviewed changes

taylor-yb-lee marked this pull request as ready for review January 28, 2026 19:18

taylor-yb-lee requested a review from a team as a code owner January 28, 2026 19:18

taylor-yb-lee requested a review from govind-ramnarayan January 28, 2026 19:18

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated Show resolved Hide resolved

taylor-yb-lee changed the title ~~[#10725][feat] Optimize Auto Deploy weight loading by preloading weights to CPU~~ [#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU Jan 29, 2026

taylor-yb-lee mentioned this pull request Jan 29, 2026

[Feature]: AutoDeploy: Weight preloading for improving model loading time for AutoDeploy Backend #11086

Open

1 task

taylor-yb-lee linked an issue Jan 29, 2026 that may be closed by this pull request

[Feature]: AutoDeploy: Weight preloading for improving model loading time for AutoDeploy Backend #11086

Open

1 task

Refactoring

2e9b2e8

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

taylor-yb-lee requested a review from lucaslie January 30, 2026 00:53

Print total time for all transforms

dadf6a0

Signed-off-by: Taylor Yeonbok Lee <[email protected]>

taylor-yb-lee mentioned this pull request Jan 30, 2026

[None][chore] Measure total time of AutoDeploy transforms #10864

Closed

1 task

[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU #11059

Are you sure you want to change the base?

[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU #11059

Uh oh!

Conversation

taylor-yb-lee commented Jan 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Release Notes

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

lucaslie Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

taylor-yb-lee Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

taylor-yb-lee Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taylor-yb-lee commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taylor-yb-lee commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading