diff --git a/docs/source/en/using-diffusers/loading.md b/docs/source/en/using-diffusers/loading.md index 591a1382967e..20f0cc51e0af 100644 --- a/docs/source/en/using-diffusers/loading.md +++ b/docs/source/en/using-diffusers/loading.md @@ -112,6 +112,30 @@ print(pipe.transformer.dtype, pipe.vae.dtype) # (torch.bfloat16, torch.float16) If a component is not explicitly specified in the dictionary and no `default` is provided, it will be loaded with `torch.float32`. +### Parallel loading + +Large models are often [sharded](../training/distributed_inference#model-sharding) into smaller files so that they are easier to load. Diffusers supports loading shards in parallel to speed up the loading process. + +Set the environment variables below to enable parallel loading. + +- Set `HF_ENABLE_PARALLEL_LOADING` to `"YES"` to enable parallel loading of shards. +- Set `HF_PARALLEL_LOADING_WORKERS` to configure the number of parallel threads to use when loading shards. More workers loads a model faster but uses more memory. + +The `device_map` argument should be set to `"cuda"` to pre-allocate a large chunk of memory based on the model size. This substantially reduces model load time because warming up the memory allocator now avoids many smaller calls to the allocator later. + +```py +import os +import torch +from diffusers import DiffusionPipeline + +os.environ["HF_ENABLE_PARALLEL_LOADING"] = "YES" +pipeline = DiffusionPipeline.from_pretrained( + "Wan-AI/Wan2.2-I2V-A14B-Diffusers", + torch_dtype=torch.bfloat16, + device_map="cuda" +) +``` + ### Local pipeline To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk.