Stable Difffusion 3.5 is an improved version of its predecessor, Stable Difffusion 3. The model is currently available in the Hugging Face Hub and can be used with the ๐งจ Diffuser.
This release includes two checkpoints.
Large-scale (8B) model Large-scale (8B) time-step distillation model that allows several steps of inference
This post focuses on how to use Stable Diffusion 3.5 (SD3.5) with Diffuser, covering both inference and training.
table of contents
Architecture changes
The SD3.5 (Large) transformer architecture is very similar to the SD3 (Medium) with the following differences:
QK Normalization: QK normalization has become standard for training large transformer models, and SD3.5 Large is no exception. Dual Attention Layer: Instead of using a single attention layer for each stream of modalities within an MMDiT block, SD3.5 uses a double attention layer.
The rest of the details regarding the text encoder, VAE, and noise scheduler are exactly the same as the SD3 Medium. For more information about SD3, we recommend checking the original paper.
Using SD3.5 with a diffuser
Be sure to install the latest version of your diffuser.
pip install -U diffuser
The model is gated, so before you can use it with your diffuser, you must first go to the Stable Diffusion 3.5 Large Hugging Face page and fill out the form to accept the gate. Once logged in, you must log in so the system knows you have accepted the gate. Log in using the command below.
hug face-cli login
The following snippet downloads the 8B parameter version of SD3.5 with torch.bfloat16 precision. This is the format used in the original checkpoint published by Stability AI and is the recommended method for performing inference.
import torch
from diffuser import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained(
“stabilityai/stable-diffusion-3.5-large”torch_dtype=torch.bfloat16 ).to(“Cuda”) image = pipe(prompt =“Photo of a cat holding a sign that says Hello World”negative prompt =“”num_inference_steps=40height =1024width=1024guidance_scale=4.5).images(0) Image.Save(“sd3_hello_world.png”)
This release also comes with a “timestep distillation” model that eliminates classifier-less guidance and can generate images in fewer steps (typically 4-8 steps).
import torch
from diffuser import StableDiffusion3Pipeline pipe = StableDiffusion3Pipeline.from_pretrained(
“stabilityai/stable-diffusion-3.5-large-turbo”torch_dtype=torch.bfloat16 ).to(“Cuda”) image = pipe(prompt =“Photo of a cat holding a sign that says Hello World”num_inference_steps=4height =1024width=1024guidance_scale=1.0).images(0) Image.Save(“sd3_hello_world.png”)
All examples shown in the SD3 blog post and official diffuser documentation should already work in SD3.5. In particular, both of these resources detail optimizing memory requirements for performing inference. SD3.5 Large is significantly larger than SD3 Medium, so memory optimization is important to enable inference at the consumer interface.
Performing inference using quantization
Diffuser natively supports processing bit-sand-byte quantization, further optimizing memory.
First, make sure to install all required libraries.
pip install -Uq git+https://github.com/huggingface/transformers@main pip install -Uq bitsandbytes
Next, load the transformer with “NF4” accuracy.
from diffuser import BitsAndBytesConfig, SD3Transformer2DModel
import Torch model ID = “Stable AI/Stable Diffusion -3.5-Large”
nf4_config = BitsAndBytesConfig(load_in_4bit=truth,bnb_4bit_quant_type=“nf4”bnb_4bit_compute_dtype=torch.bfloat16 ) model_nf4 = SD3Transformer2DModel.from_pretrained(model_id, subfolder=“transformer”quantization_config=nf4_config, torch_dtype=torch.bfloat16)
Now you are ready to perform inference.
from diffuser import StableDiffusion3Pipeline Pipeline = StableDiffusion3Pipeline.from_pretrained( model_id,Transformer=model_nf4, torch_dtype=torch.bfloat16 ) Pipeline.enable_model_cpu_offload() Prompt = โA whimsical and creative image depicting a hybrid waffle-hippo creature basking in a river of melted butter in a breakfast-themed landscape, featuring the hippo’s unique, bulky body shape. But instead the normal gray-skinned creature’s body resembles a freshly baked golden-brown crispy waffle, and its skin is textured with the familiar checkered pattern of a waffle. This environment features a hippo’s natural habitat and a breakfast table setting, a river of warm melted butter peeking through the lush pancake-like foliage in the background. It is a combination of large dishes and plates.As the sun rises in this fantastical world, a river of butter yawns a satisfied creature, and a flock of birds takes flight nearby.โ
image = pipeline( prompt = prompt, negative_prompt =“”num_inference_steps=28guidance_scale=4.5maximum sequence length =512).images(0) Image.Save(“Whimsical.png”)
You can control other knobs in BitsAndBytesConfig. See the documentation for more information.
It is also possible to directly load quantized models with the same nf4_config as above. This is especially useful for machines with low RAM. See this Colab notebook for an end-to-end example.
Training LoRA with SD3.5 Large with Quantization
Thanks to libraries like bitsandbytes and peft, it is possible to fine-tune large models like the SD3.5 Large on consumer GPU cards with 24 GB of VRAM. It is already possible to leverage existing SD3 training scripts for LoRA training. The training commands below are already working.
Accelerate startup train_dreambooth_lora_sd3.py \ –pretrained_model_name_or_path=“Stable AI/Stable Diffusion -3.5-Large” \ –dataset_name=“Norod78/Thread art style” \ –output_dir=“yart_art_sd3-5_lora” \ –mixed_precision=“BF16” \ –instance_prompt=“Frog, yarn art style” \ –caption_column=“Sentence”\ –resolution=768 \ –train_batch_size=1 \ –gradient_accumulation_steps=1 \ –learning_rate=4e-4 \ –report_to=“One Bu” \ –lr_scheduler=“Continuous” \ –lr_warmup_steps=0 \ –max_train_steps=700 \ –rank=16 \ –seed=“0” \ –push_to_hub
However, using quantize requires adjusting a few knobs. Here are some tips on how to do that.
Initialize the transformer with the quantization configuration or load the quantized checkpoint directly. Then prepare it using peft’s prepare_model_for_kbit_training(). The rest of the process remains the same thanks to peft’s strong support for bitsandbytes.
For a more complete example, see this sample script.
Using Single File Loading with Stable Diffusion 3.5 Transformer
You can load a Stable Diffusion 3.5 Transformer model using the original checkpoint file published by Stability AI using the from_single_file method.
import torch
from diffuser import SD3Transformer2DModel, StableDiffusion3Pipeline transformer = SD3Transformer2DModel.from_single_file(
“https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors”torch_dtype=torch.bfloat16, ) pipe = StableDiffusion3Pipeline.from_pretrained(
“Stable AI/Stable Diffusion -3.5-Large”transformer=transformer, torch_dtype=torch.bfloat16, ) Pipe.enable_model_cpu_offload() image = Pipe(“Cat holding a sign that says hello world”).images(0) Image.Save(“sd35.png”)
important links
Acknowledgment: The background photo used in the thumbnail in this blog post was provided by Daniel Frank. Thanks to Pedro Cuenca and Tom Aarsen for post-draft reviews.