feat: cuda device_map for pipelines. #12122

sayakpaul · 2025-08-11T06:35:26Z

What does this PR do?

TL;DR: This PR adds a device_map option at the pipeline-level to speed up end-to-end pipeline loading on a target device.

To benefit from #11904, users have to follow this pattern:

from diffusers import AutoModel, DiffusionPipeline
import torch 

# first initialize the model on device_map to benefit from fast loading
model = AutoModel.from_pretrained(..., device_map="cuda", torch_dtype=...)

# initialize the pipeline
pipe = DiffusionPipeline.from_pretrained(..., transformer=model, torch_dtype=...)

# place other stuff on cuda
for name, component in pipe.components.items():
     if name != "model_already_loaded_above":
         component.to("cuda")

# run inference
...

We could improve the UX a bit by letting the users pass a device_map="cuda" (or whatever valid value) WHILE initializing the pipe. This PR tackles that =>

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(..., device_map="cuda", torch_dtype=...)
...

For pipelines like Flux, passing device_map for loading the text encoders might be as same as doing to(). However, for pipelines like Qwen-Image, that use a mid-range model like Qwen25VL-7B, passing device_map="cuda" while initializing the pipeline should be beneficial (of course, the target device should have enough VRAM to support this). Below are the results I got for Qwen-Image (with cold cache):

time: 8.494s (no device_map)
time: 6.678s (device_map)

Code

import time
t_ini = time.time()

import torch
from diffusers import DiffusionPipeline
print(f"import time: {time.time() - t_ini:.3f}s")

model_id = "Qwen/Qwen-Image"

t0 = time.time()
torch.cuda.synchronize()
print(f"CUDA sync time: {time.time() - t0:.3f}s")

print("starting pipe load")
t1 = time.time()
pipe = DiffusionPipeline.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
torch.cuda.synchronize()
t2 = time.time()

diff = t2 - t1
print(f"time: {diff:.3f}s")

print(getattr(pipe, "hf_device_map", None))
_ = pipe("dog", num_inference_steps=2)

Any objections?

HuggingFaceDocBuilderDev · 2025-08-11T06:43:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

DN6

LGTM. Could we add a simple fast GPU test

sayakpaul · 2025-08-12T16:16:51Z

@DN6 done! I have added a test, too.

DN6 · 2025-08-13T08:20:34Z

src/diffusers/pipelines/pipeline_utils.py

    numpy_to_pil,
 )
 from ..utils.hub_utils import _check_legacy_sharding_variant_format, load_or_create_model_card, populate_model_card
+from ..utils.testing_utils import torch_device


I would prefer not to import from testing_utils for non-test modules (in fact we should move this module out of src). It would be better to redefine the relevant torch_device functionality in torch_utils and import from there.

Yeah my thoughts too. We should actually move torch_device to utils.

SunMarc

Nice !

feat: cuda device_map for pipelines.

eba6f82

sayakpaul requested review from SunMarc and a-r-r-o-w August 11, 2025 06:35

Merge branch 'main' into cuda-device-map-pipe

b4483cc

DN6 reviewed Aug 12, 2025

View reviewed changes

sayakpaul mentioned this pull request Aug 12, 2025

[core] parallel loading of shards #12028

Merged

sayakpaul added 2 commits August 12, 2025 21:11

Merge branch 'main' into cuda-device-map-pipe

7228ab8

up

a31e59e

sayakpaul changed the title ~~[wip] feat: cuda device_map for pipelines.~~ feat: cuda device_map for pipelines. Aug 12, 2025

sayakpaul requested a review from DN6 August 12, 2025 16:16

sayakpaul marked this pull request as ready for review August 12, 2025 16:16

sayakpaul added 3 commits August 12, 2025 22:00

up

30c575b

Merge branch 'main' into cuda-device-map-pipe

0ab1d9e

empty

3dd4fb2

sayakpaul mentioned this pull request Aug 13, 2025

[docs] Parallel loading of shards #12135

Merged

Merge branch 'main' into cuda-device-map-pipe

8448bdf

DN6 reviewed Aug 13, 2025

View reviewed changes

up

f657b6b

sayakpaul requested a review from DN6 August 13, 2025 08:30

DN6 approved these changes Aug 14, 2025

View reviewed changes

sayakpaul added 2 commits August 14, 2025 08:42

Merge branch 'main' into cuda-device-map-pipe

5e6f142

Merge branch 'main' into cuda-device-map-pipe

736971c

sayakpaul merged commit 46a0c6a into main Aug 14, 2025
35 checks passed

sayakpaul deleted the cuda-device-map-pipe branch August 14, 2025 05:01

SunMarc reviewed Aug 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: cuda device_map for pipelines. #12122

feat: cuda device_map for pipelines. #12122

Uh oh!

sayakpaul commented Aug 11, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 11, 2025

Uh oh!

DN6 left a comment

Uh oh!

sayakpaul commented Aug 12, 2025

Uh oh!

DN6 Aug 13, 2025

Uh oh!

sayakpaul Aug 13, 2025

Uh oh!

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

feat: cuda device_map for pipelines. #12122

feat: cuda device_map for pipelines. #12122

Uh oh!

Conversation

sayakpaul commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 11, 2025

Uh oh!

DN6 left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Aug 12, 2025

Uh oh!

DN6 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented Aug 11, 2025 •

edited

Loading