Skip to content

Conversation

@taylor-yb-lee
Copy link
Collaborator

@taylor-yb-lee taylor-yb-lee commented Jan 28, 2026

Improved the AutoDeploy backend loading efficiency by replacing the accelerate module's parameter-wise verification with a file-based preloading strategy. By loading weight files directly to the CPU before transferring them to the device, we bypass the overhead of individual parameter checks. For models exceeding 40K parameters, this change reduced loading latency by 80%.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

  • Chores
    • Optimized HuggingFace model checkpoint loading with configurable CPU preload behavior. Users can now control whether checkpoints are preloaded to CPU or loaded directly onto the target device, improving efficiency and memory management based on specific deployment scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@taylor-yb-lee taylor-yb-lee added the AutoDeploy <NV> AutoDeploy Backend label Jan 28, 2026
Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me overall

Comment on lines +439 to +442
# Choose loading method based on environment variable
# Default behavior: preload checkpoint files to CPU
# Set AD_DISABLE_PRELOAD=1 to use accelerate's load_checkpoint_in_model (no CPU preload)
disable_preload = os.environ.get("AD_DISABLE_PRELOAD", "0") == "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to keep this configurability or remove it?

Seems to me we don't need to keep it around

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure, but would it be useful for trying turning off the preloading on host machines w/ small memory?
(Though PT backend is not allowing turning this off though..)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taylor-yb-lee taylor-yb-lee marked this pull request as ready for review January 28, 2026 19:18
@taylor-yb-lee taylor-yb-lee requested a review from a team as a code owner January 28, 2026 19:18
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Modified checkpoint loading in HuggingFace model factory to support configurable preload behavior. When disabled via AD_DISABLE_PRELOAD flag, uses accelerate's direct device loading; when enabled, preloads checkpoint to CPU memory first. Added support for loading from index.json files by aggregating weights across referenced checkpoints.

Changes

Cohort / File(s) Summary
Checkpoint Loading Refactor
tensorrt_llm/_torch/auto_deploy/models/hf.py
Added configurable CPU preload flow with new helper methods _load_checkpoint_with_preload() and _load_full_checkpoint_to_cpu(). Extended logic to support index.json-based checkpoint aggregation. Added imports: json, safetensors.torch. Introduced AD_DISABLE_PRELOAD flag to toggle between direct device loading and CPU preload paths. Enhanced logging for rank, preload decisions, and progress tracking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Factory as AutoModelForCausalLMFactory
    participant Accelerate as accelerate
    participant CPUMem as CPU Memory
    participant Model as nn.Module
    participant Checkpoint as Checkpoint Files

    User->>Factory: Load HF model with checkpoint
    Factory->>Factory: Check AD_DISABLE_PRELOAD flag
    
    alt AD_DISABLE_PRELOAD enabled (default)
        Factory->>Checkpoint: Read checkpoint/index.json
        Checkpoint-->>Factory: Checkpoint data/file list
        Factory->>CPUMem: Load full checkpoint to CPU
        CPUMem-->>Factory: Aggregated weights dict
        Factory->>Model: Load weights into model
        Model-->>Factory: Model loaded
        Factory->>CPUMem: Free CPU memory
    else AD_DISABLE_PRELOAD disabled
        Factory->>Accelerate: Use load_checkpoint_in_model
        Accelerate->>Checkpoint: Load directly on device
        Checkpoint-->>Accelerate: Weights on device
        Accelerate->>Model: Populate model
        Model-->>Accelerate: Model loaded
    end
    
    Factory-->>User: Model ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description has a clear opening statement explaining the optimization strategy and performance improvement, but lacks required sections from the template: 'Description' and 'Test Coverage' sections are present as commented placeholders only, not filled with substantive content. The PR Checklist is partially completed. Provide detailed explanations in the 'Description' section about the issue being solved and the solution implemented. Clearly document test coverage by listing relevant tests that validate the new preloading logic and ensure backward compatibility with the accelerate module path.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: optimizing Auto Deploy weight loading through CPU preloading, which aligns with the implementation of _load_checkpoint_with_preload and the configurable AD_DISABLE_PRELOAD flow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/auto_deploy/models/hf.py`:
- Around line 460-477: The method _load_checkpoint_with_preload currently
ignores the device parameter when placing tensors; wrap the
model.load_state_dict(all_weights, strict=False) call with the same
hf_load_state_dict_with_device(device) context manager used elsewhere (or
otherwise map/move the CPU-stored tensors to the target device before/while
loading) so weights land on the requested device; reference the helper
hf_load_state_dict_with_device and the loader _load_full_checkpoint_to_cpu to
locate where to add the context manager and then delete all_weights after
loading as before.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

479-521: Implementation is sound with appropriate format handling.

The method correctly handles both index.json multi-file checkpoints and single checkpoint files. Good security practice using weights_only=True for torch.load.

Minor style note: Static analysis flagged long exception messages on lines 519 and 521. Consider extracting to a custom exception class if this pattern is used elsewhere, but this is not critical.

♻️ Optional: Extract exception messages to improve readability
+class CheckpointLoadError(ValueError):
+    """Exception for checkpoint loading failures."""
+    pass
+
 def _load_full_checkpoint_to_cpu(self, ckpt_file: str) -> dict:
     """Load the full checkpoint to CPU memory."""
     # ... existing code ...
             else:
-                raise ValueError(f"Unsupported checkpoint format: {ckpt_file}")
+                raise CheckpointLoadError(f"Unsupported checkpoint format: {ckpt_file}")
         else:
-            raise ValueError(f"Checkpoint file not found or unsupported: {ckpt_file}")
+            raise CheckpointLoadError(f"Checkpoint file not found or unsupported: {ckpt_file}")

@taylor-yb-lee taylor-yb-lee changed the title [#10725][feat] Optimize Auto Deploy weight loading by preloading weights to CPU [#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU Jan 29, 2026
Signed-off-by: Taylor Yeonbok Lee <[email protected]>
@taylor-yb-lee
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34146 [ run ] triggered by Bot. Commit: 2e9b2e8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34146 [ run ] completed with state SUCCESS. Commit: 2e9b2e8
/LLM/main/L0_MergeRequest_PR pipeline #26344 completed with status: 'SUCCESS'

Signed-off-by: Taylor Yeonbok Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AutoDeploy <NV> AutoDeploy Backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: Weight preloading for improving model loading time for AutoDeploy Backend

3 participants