31 Aug 18:43

6fc8aff

Patch Release 0.20.2 - Correct SDXL Inpaint Strength Default

Stable Diffusion XL's strength default was accidentally set to 1.0 when creating the pipeline. The default should be set to 0.9999 instead. This patch release fixes that.

All commits

[SDXL Inpaint] Correct strength default by @patrickvonplaten in #4858

Contributors

patrickvonplaten

Assets 2

28 Aug 04:47

sayakpaul

v0.20.1

9b93b6c

Patch Release: Fix `torch.compile()` support for ControlNets

3eb498e#r125606630 introduced a 🐛 that broke the torch.compile() support for ControlNets. This patch release fixes that.

All commits

[Docs] Fix docs controlnet missing /Tip by @patrickvonplaten in #4717
[Torch compile] Fix torch compile for controlnet by @patrickvonplaten in #4795

Contributors

patrickvonplaten

Assets 2

17 Aug 08:46

sayakpaul

v0.20.0

e3380c0

v0.20.0: SDXL ControlNets with MultiControlNet, GLIGEN, Tiny Autoencoder, SDXL DreamBooth LoRA in free-tier Colab, and more

SDXL ControlNets 🚀

The 🧨 diffusers team has trained two ControlNets on Stable Diffusion XL (SDXL):

Canny (diffusers/controlnet-canny-sdxl-1.0)
Depth (diffusers/controlnet-depth-sdxl-1.0)

You can find all the SDXL ControlNet checkpoints here, including some smaller ones (5 to 7x smaller).

To know more about how to use these ControlNets to perform inference, check out the respective model cards and the documentation. To train custom SDXL ControlNets, you can try out our training script.

MultiControlNet for SDXL

This release also introduces support for combining multiple ControlNets trained on SDXL and performing inference with them. Refer to the documentation to learn more.

GLIGEN

The GLIGEN model was developed by researchers and engineers from University of Wisconsin-Madison, Columbia University, and Microsoft. The StableDiffusionGLIGENPipeline can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it’ll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It’s trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.

(GIF from the official website)

Grounded inpainting

import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image

# Insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
    "masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

input_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
)
prompt = "a birthday cake"
boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
phrases = ["a birthday cake"]

images = pipe(
    prompt=prompt,
    gligen_phrases=phrases,
    gligen_inpaint_image=input_image,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    output_type="pil",
    num_inference_steps=50,
).images
images[0].save("./gligen-1-4-inpainting-text-box.jpg")

Grounded generation

import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image

# Generate an image described by the prompt and
# insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
    "masterful/gligen-1-4-generation-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "a waterfall and a modern high speed train running through the tunnel in a beautiful forest with fall foliage"
boxes = [[0.1387, 0.2051, 0.4277, 0.7090], [0.4980, 0.4355, 0.8516, 0.7266]]
phrases = ["a waterfall", "a modern high speed train running through the tunnel"]

images = pipe(
    prompt=prompt,
    gligen_phrases=phrases,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    output_type="pil",
    num_inference_steps=50,
).images
images[0].save("./gligen-1-4-generation-text-box.jpg")

Refer to the documentation to learn more.

Thanks to @nikhil-masterful for contributing GLIGEN in #4441.

Tiny Autoencoder

@madebyollin trained two Autoencoders (on Stable Diffusion and Stable Diffusion XL, respectively) to dramatically cut down the image decoding time. The effects are especially pronounced when working with larger-resolution images. You can use AutoencoderTiny to take advantage of it.

Here’s the example usage for Stable Diffusion:

import torch
from diffusers import DiffusionPipeline, AutoencoderTiny

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")

Refer to the documentation to learn more. Refer to this material to understand the implications of using this Autoencoder in terms of inference latency and memory footprint.

Fine-tuning Stable Diffusion XL with DreamBooth and LoRA on a free-tier Colab Notebook

Stable Diffusion XL’s (SDXL) high memory requirements often seem restrictive when it comes to using it for downstream applications. Even if one uses parameter-efficient fine-tuning techniques like LoRA, fine-tuning just the UNet component of SDXL can be quite memory-intensive. So, running it on a free-tier Colab Notebook (that usually has a 16 GB T4 GPU attached) seems impossible.

Now, with better support for gradient checkpointing and other recipes like 8 Bit Adam (via bitsandbytes), it is possible to fine-tune the UNet of SDXL with DreamBooth and LoRA on a free-tier Colab Notebook.

Check out the Colab Notebook to learn more.

Thanks to @ethansmith2000 for improving the gradient checkpointing support in #4474.

Support of `push_to_hub` for models, schedulers, and pipelines

Our models, schedulers, and pipelines now support an option of push_to_hub via the save_pretrained() and also come with a push_to_hub() method. Below are some examples of usage.

Models

from diffusers import ControlNetModel

controlnet = ControlNetModel(
    block_out_channels=(32, 64),
    layers_per_block=2,
    in_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
    cross_attention_dim=32,
    conditioning_embedding_out_channels=(16, 32),
)
controlnet.push_to_hub("my-controlnet-model")
# or controlnet.save_pretrained("my-controlnet-model", push_to_hub=True)

Schedulers

from diffusers import DDIMScheduler

scheduler = DDIMScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
)
scheduler.push_to_hub("my-controlnet-scheduler")

Pipelines

from diffusers import (
    UNet2DConditionModel,
    AutoencoderKL,
    DDIMScheduler,
    StableDiffusionPipeline,
)
from transformers import CLIPTextModel, CLIPTextConfig, CLIPTokenizer

unet = UNet2DConditionModel(
    block_out_channels=(32, 64),
    layers_per_block=2,
    sample_size=32,
    in_channels=4,
    out_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
    up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
    cross_attention_dim=32,
)

scheduler = DDIMScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
)

vae = AutoencoderKL(
    block_out_channels=[32, 64],
    in_channels=3,
    out_channels=3,
    down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
    up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
    latent_channels=4,
)

text_encoder_config = CLIPTextConfig(
    bos_token_id=0,
    eos_token_id=2,
    hidden_size=32,
    intermediate_size=37,
    layer_norm_eps=1e-05,
    num_attention_heads=4,
    num_hidden_layers=5,
    pad_token_id=1,
    vocab_size=1000,
)
text_encoder = CLIPTextModel(text_encoder_config)
tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")

components = {
    "unet": unet,
    "scheduler": scheduler,
    "vae": vae,
    "text_encoder": text_encoder,
    "tokenizer": tokenizer,
    "safety_checker": None,
    "feature_extractor": None,
}
pipeline = StableDiffusionPipeline(**components)
pipeline.push_to_hub("my-pipeline")

Refer to the documentation to know more.

Thanks to @Wauplin for his generous and constructive feedback (refer to this #4218) on this feature.

Better support for loading Kohya-trained LoRA checkpoints

Providing seamless support for loading Kohya-trained LoRA checkpoints from diffusers is important for us. This is wh...

Contributors

levi, slessans, and 45 other contributors

Assets 2

30 Jul 10:26

patrickvonplaten

v0.19.3

de9c72d

Patch release: Fix incorrect filenaming

0.19.3 is a patch release to make sure import diffusers works without transformers being installed.

It includes a fix of this issue.

All commits

[SDXL] Fix dummy imports incorrect naming by @patrickvonplaten in #4370

Contributors

patrickvonplaten

Assets 2

28 Jul 18:27

patrickvonplaten

v0.19.2

965e52c

Patch Release: Support for SDXL Kohya-style LoRAs, Fix batched inference SDXL Img2Img, Improve watermarker

We still had some bugs 🐛 in 0.19.1 some bugs, notably:

SDXL (Kohya-style) LoRA

The official SD-XL 1.0 LoRA (Kohya-styled) is now supported thanks to #4287. You can try it as follows:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipe.load_lora_weights("stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors")
pipe.to("cuda")

prompt = "beautiful scenery nature glass bottle landscape, purple galaxy bottle"
negative_prompt = "text, watermark"

image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=25).images[0]

In addition, a couple more SDXL LoRAs are now supported:

(SDXL 0.9:)

To know more details and the known limitations, please check out the documentation.

Thanks to @isidentical for their sincere help in the PR.

Batched inference

@bghira found that for SDXL Img2Img batched inference led to weird artifacts. That is fixed in: #4327.

Downloads

Under some circumstances SD-XL 1.0 can download ONNX weights which is corrected in #4338.

Improved SDXL behavior

#4346 allows the user to disable the watermarker under certain circumstances to improve the usability of SDXL.

All commits:

[SDXL Refiner] Fix refiner forward pass for batched input by @patrickvonplaten in #4327
[ONNX] Don't download ONNX model by default by @patrickvonplaten in #4338
[SDXL] Make watermarker optional under certain circumstances to improve usability of SDXL 1.0 by @patrickvonplaten in #4346
[Feat] Support SDXL Kohya-style LoRA by @sayakpaul in #4287

Contributors

sayakpaul, patrickvonplaten, and 2 other contributors

Assets 2

27 Jul 18:41

patrickvonplaten

v0.19.1

aa4634a

Patch Release: Fix torch compile and local_files_only

In 0.19.0 some bugs 🐛 found their way into the release. We're very sorry about this 🙏

This patch releases fixes all of them.

All commits

update Kandinsky doc by @yiyixuxu in #4301
[Torch.compile] Fixes torch compile graph break by @patrickvonplaten in #4315
Fix SDXL conversion from original to diffusers by @duongna21 in #4280
fix a bug in StableDiffusionUpscalePipeline when prompt is None by @yiyixuxu in #4278
[Local loading] Correct bug with local files only by @patrickvonplaten in #4318
Release: v0.19.1 by @patrickvonplaten (direct commit on v0.19.1-patch)

Contributors

yiyixuxu, patrickvonplaten, and duongna21

Assets 2

26 Jul 19:35

patrickvonplaten

v0.19.0

ef9824f

v0.19.0: SD-XL 1.0 (permissive license), AutoPipelines, Improved Kanidnsky & Asymmetric VQGAN, T2I Adapter

SDXL 1.0

Stable Diffusion XL (SDXL) 1.0 with permissive CreativeML Open RAIL++-M License was released today. We provide full compatibility with SDXL in diffusers.

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]
image

Many additional cool features are released:

Pipelines for
- Img2Img
- Inpainting
Torch compile support
Model offloading
Ensemble of Denoising Exports (E-Diffi approach) - thanks to @bghira @SytanSD @Birch-san @AmericanPresidentJimmyCarter

Refer to the documentation to know more.

New training scripts for SDXL

When there’s a new pipeline, there ought to be new training scripts. We added support for the following training scripts that build on top of SDXL:

Shoutout to @harutatsuakiyama for contributing the training script for InstructPix2Pix in #4079.

New pipelines for SDXL

The ControlNet and InstructPix2Pix training scripts also needed their respective pipelines. So, we added support for the following pipelines as well:

StableDiffusionXLControlNetPipeline
StableDiffusionXLInstructPix2PixPipeline

The ControlNet and InstructPix2Pix pipelines don’t have interesting checkpoints yet. We hope that the community will be able to leverage the training scripts from this release to help produce some.

Shoutout to @harutatsuakiyama for contributing the StableDiffusionXLInstructPix2PixPipeline in #4079.

The AutoPipeline API

We now support Auto APIs for the following tasks: text-to-image, image-to-image, and inpainting:

Here is how to use one:

from diffusers import AutoPipelineForTextToImage
import torch

pipe_t2i = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", requires_safety_checker=False, torch_dtype=torch.float16
).to("cuda")

prompt = "photo a majestic sunrise in the mountains, best quality, 4k"
image = pipe_t2i(prompt).images[0]
image.save("image.png")

Without any extra memory, you can then switch to Image-to-Image

from diffusers import AutoPipelineForImageToImage

pipe_i2i = AutoPipelineForImageToImage.from_pipe(pipe_t2i)

image = pipe_t2i("sunrise in snowy mountains", image=image, strength=0.75).images[0]
image.save("image.png")

Supported Pipelines: SDv1, SDv2, SDXL, Kandinksy, ControlNet, IF ... with more to come.

Refer to the documentation to know more.

A new “combined pipeline” for the Kandinsky series

We introduced a new “combined pipeline” for the Kandinsky series to make it easier to use the Kandinsky prior and decoder together. This eliminates the need to initialize and use multiple pipelines for Kandinsky to generate images. Here is an example:

from diffusers import AutoPipelineForTextToImage
import torch

pipe = AutoPipelineForTextToImage.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipe(prompt=prompt, num_inference_steps=25).images[0] 
image.save("image.png")

The following pipelines, which can be accessed via the "Auto" pipelines were added:

To know more, check out the following pages:

🚨🚨🚨 Breaking change for Kandinsky Mask Inpainting 🚨🚨🚨

NOW: mask_image repaints white pixels and preserves black pixels.

Kandinksy was using an incorrect mask format. Instead of using white pixels as a mask (like SD & IF do), Kandinsky models were using black pixels. This needs to be corrected and so that the diffusers API is aligned. We cannot have different mask formats for different pipelines.

Important => This means that everyone that already used Kandinsky Inpaint in production / pipeline now needs to change the mask to:

# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)

# For PyTorch and Numpy input
mask = 1 - mask

Asymmetric VQGAN

Designing a Better Asymmetric VQGAN for StableDiffusion introduced a VQGAN that is particularly well-suited for inpainting tasks. This release brings the support of this new VQGAN. Here is how it can be used:

from io import BytesIO
from PIL import Image
import requests
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline

def download_image(url: str) -> Image.Image:
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")

prompt = "a photo of a person"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"

image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))

pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
pipe.to("cuda")

image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
image.save("image.jpeg")

Refer to the documentation to know more.

Thanks to @cross-attention for contributing this model in #3956.

Improved support for loading Kohya-style LoRA checkpoints

We are committed to providing seamless interoperability support of Kohya-trained checkpoints from diffusers. To that end, we improved the existing support for loading Kohya-trained checkpoints in diffusers. Users can expect further improvements in the upcoming releases.

Thanks to @takuma104 and @isidentical for contributing the improvements in #4147.

T2I Adapter

pip install matplotlib

from PIL import Image
import torch
import numpy as np
import matplotlib
from diffusers import T2IAdapter, StableDiffusionAdapterPipeline

def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invalid_mask=None, background_color=(128, 128, 128, 255), gamma_corrected=False, value_transform=None):
    """Converts a depth map to a color image.

    Args:
        value (torch.Tensor, numpy.ndarry): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
        vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None.
        vmax (float, optional):  vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None.
        cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'.
        invalid_val (int, optional): Specifies value of invalid pixels that should be colored as 'background_color'. Defaults to -99.
        invalid_mask (numpy.ndarray, optional): Boolean mask for invalid regions. Defaults to None.
        background_color (tuple[int], optional): 4-tuple RGB color to give to invalid pixels. Defaults to (128, 128, 128, 255).
        gamma_corrected (bool, optional): Apply gamma correction to colored image. Defaults to False.
        value_transform (Callable, optional): Apply transform funct...

Contributors

takuma104, larme, and 38 other contributors

Assets 2

11 Jul 17:21

patrickvonplaten

v0.18.2

5e80827

Patch Release: v0.18.2

Patch release to fix:

1. torch.compile for SD-XL for certain GPUs
1. from_single_file for all SD models
1. Fix broken ONNX export
1. Fix incorrect VAE FP16 casting
1. Deprecate loading variants that don't exist

Note:

Loading any stable diffusion safetensors or ckpt with StableDiffusionPipeline.from_single_file or StableDiffusionmg2ImgIPipeline.from_single_file or StableDiffusionInpaintPipeline.from_single_file or StableDiffusionXLPipeline.from_single_file, ...

is now almost as fast as from_pretrained(...) and it's much more tested now.

All commits:

Make sure torch compile doesn't access unet config by @patrickvonplaten in #4008
[DiffusionPipeline] Deprecate not throwing error when loading non-existant variant by @patrickvonplaten in #4011
Correctly keep vae in float16 when using PyTorch 2 or xFormers by @pcuenca in #4019
minor improvements to the SDXL doc. by @sayakpaul in #3985
Remove remaining not in upscale pipeline by @pcuenca in #4020
FIX force_download in download utility by @Wauplin in #4036
Improve single loading file by @patrickvonplaten in #4041
keep _use_default_values as a list type by @oOraph in #4040

Contributors

pcuenca, Wauplin, and 3 other contributors

Assets 2

07 Jul 15:07

patrickvonplaten

v0.18.1

1c0f6bb

Patch Release for Stable Diffusion XL 0.9

Patch release 0.18.1: Stable Diffusion XL 0.9 Research Release

Stable Diffusion XL 0.9 is now fully supported under the SDXL 0.9 Research License license here.

Having received access to stabilityai/stable-diffusion-xl-base-0.9, you can easily use it with diffusers:

Text-to-Image

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]

Refining the image output

from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]

Loading single file checkpoitns / original file format

from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")

refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")

Memory optimization via model offloading

- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()

and

- refiner.to("cuda")
+ refiner.enable_model_cpu_offload()

Speed-up inference with torch.compile

+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)

Note: If you're running the model with < torch 2.0, please make sure to run:

+pipe.enable_xformers_memory_efficient_attention()
+refiner.enable_xformers_memory_efficient_attention()

For more details have a look at the official docs.

All commits

typo in safetensors (safetenstors) by @YoraiLevi in #3976
Fix code snippet for Audio Diffusion by @osanseviero in #3987
feat: add Dropout to Flax UNet by @SauravMaheshkar in #3894
Add 'rank' parameter to Dreambooth LoRA training script by @isidentical in #3945
Don't use bare prints in a library by @cmd410 in #3991
[Tests] Fix some slow tests by @patrickvonplaten in #3989
Add sdxl prompt embeddings by @patrickvonplaten in #3995

Contributors

osanseviero, patrickvonplaten, and 4 other contributors

Assets 2

06 Jul 17:50

patrickvonplaten

v0.18.0

6fc169c

Shap-E, Consistency Models, Video2Video

Shap-E

Shap-E is a 3D image generation model from OpenAI introduced in Shap-E: Generating Conditional 3D Implicit Functions.

We provide support for text-to-3d image generation and 2d-to-3d image generation from Diffusers.

Text to 3D

import torch
from diffusers import ShapEPipeline
from diffusers.utils import export_to_gif

ckpt_id = "openai/shap-e"
pipe = ShapEPipeline.from_pretrained(ckpt_id).to("cuda")

guidance_scale = 15.0
prompt = "A birthday cupcake"
images = pipe(
    prompt,
    guidance_scale=guidance_scale,
    num_inference_steps=64,
    frame_size=256,
).images

gif_path = export_to_gif(images[0], "cake_3d.gif")

Image to 3D

import torch
from diffusers import ShapEImg2ImgPipeline
from diffusers.utils import export_to_gif, load_image

ckpt_id = "openai/shap-e-img2img"
pipe = ShapEImg2ImgPipeline.from_pretrained(ckpt_id).to("cuda")

img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"
image = load_image(img_url)

generator = torch.Generator(device="cuda").manual_seed(0)
batch_size = 4
guidance_scale = 3.0

images = pipe(
    image, 
    num_images_per_prompt=batch_size, 
    generator=generator, 
    guidance_scale=guidance_scale,
    num_inference_steps=64, 
    frame_size =256, 
    output_type="pil"
).images

gif_path = export_to_gif(images[0], "burger_sampled_3d.gif")

Original image

Generated

For more details, check out the official documentation.

The model was contributed by @yiyixuxu in #3742.

Consistency models

Consistency models are diffusion models supporting fast one or few-step image generation. It was proposed by OpenAI in Consistency Models.

import torch

from diffusers import ConsistencyModelPipeline

device = "cuda"
# Load the cd_imagenet64_l2 checkpoint.
model_id_or_path = "openai/diffusers-cd_imagenet64_l2"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Onestep Sampling
image = pipe(num_inference_steps=1).images[0]
image.save("consistency_model_onestep_sample.png")

# Onestep sampling, class-conditional image generation
# ImageNet-64 class label 145 corresponds to king penguins
image = pipe(num_inference_steps=1, class_labels=145).images[0]
image.save("consistency_model_onestep_sample_penguin.png")

# Multistep sampling, class-conditional image generation
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo.
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77
image = pipe(timesteps=[22, 0], class_labels=145).images[0]
image.save("consistency_model_multistep_sample_penguin.png")

For more details, see the official docs.

The model was contributed by our community members @dg845 and @ayushtues in #3492.

Video-to-Video

Previous video generation pipelines tended to produce watermarks because those watermarks were present in their pretraining dataset. With the latest additions of the following checkpoints, we can now generate watermark-free videos:

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_path = export_to_video(video_frames)

For more details, check out the official docs.

It was contributed by @patrickvonplaten in #3900.

All commits

remove seed by @yiyixuxu in #3734
Correct Token to upload docs by @patrickvonplaten in #3744
Correct another push token by @patrickvonplaten in #3745
[Stable Diffusion Inpaint & ControlNet inpaint] Correct timestep inpaint by @patrickvonplaten in #3749
[Documentation] Replace dead link to Flax install guide by @JeLuF in #3739
[documentation] grammatical fixes in installation.mdx by @LiamSwayne in #3735
Text2video zero refinements by @19and99 in #3733
[Tests] Relax tolerance of flaky failing test by @patrickvonplaten in #3755
[MultiControlNet] Allow save and load by @patrickvonplaten in #3747
Update pipeline_flax_stable_diffusion_controlnet.py by @jfozard in #3306
update conversion script for Kandinsky unet by @yiyixuxu in #3766
[docs] Fix Colab notebook cells by @stevhliu in #3777
[Bug Report template] modify the issue template to include core maintainers. by @sayakpaul in #3785
[Enhance] Update reference by @okotaku in #3723
Fix broken cpu-offloading in legacy inpainting SD pipeline by @cmdr2 in #3773
Fix some bad comment in training scripts by @patrickvonplaten in #3798
Added LoRA loading to StableDiffusionKDiffusionPipeline by @tripathiarpan20 in #3751
UnCLIP Image Interpolation -> Keep same initial noise across interpolation steps by @Abhinay1997 in #3782
feat: add PR template. by @sayakpaul in #3786
Ldm3d first PR by @estelleafl in #3668
Complete set_attn_processor for prior and vae by @patrickvonplaten in #3796
fix typo by @Isotr0py in #3800
manual check for checkpoints_total_limit instead of using accelerate by @williamberman in #3681
[train text to image] add note to loading from checkpoint by @williamberman in #3806
device map legacy attention block weight conversion by @williamberman in #3804
[docs] Zero SNR by @stevhliu in #3776
[ldm3d] Fixed small typo by @estelleafl in #3820
[Examples] Improve the model card pushed from the train_text_to_image.py script by @sayakpaul in #3810
[Docs] add missing pipelines from the overview pages and minor fixes by @sayakpaul in #3795
[Pipeline] Add new pipeline for ParaDiGMS -- parallel sampling of diffusion models by @AndyShih12 in #3716
Update control_brightness.mdx by @dqueue in #3825
Support ControlNet models with different number of channels in control images by @JCBrouwer in #3815
Add ddpm kandinsky by @yiyixuxu in #3783
[docs] More API stuff by @stevhliu in #3835
relax tol attention conversion test by @williamberman in #3842
fix: random module seeding by @sayakpaul in #3846
fix audio_diffusion tests by @teticio in #3850
Correct bad attn naming by @patrickvonplaten in #3797
[Conversion] Small fixes by @patrickvonplaten in #3848
Fix some audio tests by @patrickvonplaten in #3841
[Docs] add: contributor note in the paradigms docs. by @sayakpaul in #3852
Update Habana Gaudi doc by @regisss in #3863
Add guidance start/stop by @holwech in #3770
feat: rename single-letter vars in resnet.py by @SauravMaheshkar in #3868
Fixing the global_step key not found by @VincentNeemie in #3844
Support for manual CLIP loading in StableDiffusionPipeline - txt2img. by @WadRex in #3832
fix sde add noise typo by @UranusITS in #3839
[Tests] add test for checking soft dependencies. by @sayakpaul in #3847
[Enhance] Add LoRA rank args in train_text_to_image_lora by @okotaku in #3866
[docs] Model API by @stevhliu in #3562
fix/docs: Fix the broken doc links by @Aisuko in #3897
Add video img2img by @patrickvonplaten in #3900
fix/doc-code: Updating to the latest version parameters by @Aisuko in #3924
fix/doc: no import torch issue by @Aisuko in #3923
Correct controlnet out of list error by @patrickvonplaten in #3928
Adding better way to define multiple concepts and also validation capabilities. by @mauricio-repetto in #3807
[ldm3d] Update code to be functional with the new checkpoints by @estelleafl in #3875
Improve memory text to video by @patrickvonplaten in #3930
revert automatic chunking by @patrickvonplaten in #3934
avoid upcasting by assigning dtype to noise tensor by @prathikr in #3713
Fix failing np tests by @patrickvonplaten in #3942
Add timestep_spacing and steps_offset to schedulers by @pcuenca in #3947
Add Consistency Models Pipeline by @dg845 in #3492
Update consistency_models.mdx by @sayakpaul in #3961
Make UNet2DConditionOutput pickle-able by @prathikr in #3857
[Consistency Models] correct checkpoint url in the doc by @sayakpaul in #3962
[Text-to-video] Add torch.compile() compatibility by @sayakpaul in #3949
[SD-XL] Add new pipelines by @patrickvonplaten in #3859
Kandinsky 2.2 by @cene555 in #3903
Add Shap-E by @yiyixuxu in #3742
disable num attenion heads by @patrickvonplaten in #3969
Improve SD XL by @patrickvonplaten in #3968
fix/doc-code: import torch and fix the broken document address by @Aisuko in #3941