Stable Diffusion 3 (SD3), the latest iteration of models in the stable diffusion family of stable AI, is now available on hugging face hubs and can be used with diffusers.
The model released today is a stable diffusion 3 medium with 2B parameters.
As part of this release, we provided:
Hub Diffuser Integrated SD3 Dream Booth and Lora Training Script Model
table of contents
What’s new in SD3?
Model
SD3 is a potential diffusion model consisting of three different text encoders (Clip L/14, OpenCLip Bigg/14, and T5-V1.1-XXL), a new multimodal diffusion transformer (MMDIT) model, and a 16-channel automatic encoder model similar to stable diffusion XL.
SD3 handles text input and pixel latency as an embedded sequence. Position encoding is added to the 2×2 patch of latent material and is then flattened into a patch encoding sequence. This sequence is fed into the MMDIT block along with a text-encoded sequence, embedded in a common dimension, concatenated and passed through modulated attention and MLP sequences.
To explain the difference between the two modalities, the MMDIT block uses two separate sets of weights to embed text and image sequences in a common dimension. These sequences are combined before the attention operation. This allows both expressions to behave in their own spaces, while taking into account other expressions during attention operations. This bidirectional flow of information between text and image data differs from previous approaches for text-to-image composition. Text information is embedded into the potential through mutual participation using fixed textual representations.
SD3 uses pooled text embeddings from both clip models as part of the time step conditioning. These embeddings are first concatenated and added to the timestep embedding before they are passed to each MMDIT block.
Training with Modified Flow Matching
In addition to architectural changes, SD3 trains the model by applying conditional flow matching goals. In this approach, the forward noise process is defined as a rectifying flow that connects the data and noise distributions by a linear connection.
The modified flow matching sampling process is simpler and works well with reduced number of sampling steps. To support inference in SD3, we have introduced a new scheduler (FlowMatcheulerDiscreTeScheduler) with modified flow matching formulation and Euler method steps. It also implements resolution-dependent shifts of time step schedules via shift parameters. Increasing the shift value will handle noise scaling properly for higher resolution. We recommend using Shift = 3.0 for your 2B model.
To quickly try out SD3, see the following applications:
Use SD3 with Diffusers
To use SD3 with Diffusers, upgrade to the latest Diffusers release.
PIP Installation – Upgrade the Diffuser
Because the model is gated, you must first move to a stable diffusion 3 medium embracing face page before using it with a diffuser. Fill out the form to accept the gate. Once you’re in, you’ll need to log in so that you know the system has accepted the gate. Log in using the following command:
Huggingface-Cli Login
The following snippet downloads the 2B parameter version of SD3 to FP16 Precision. This is the format used by the original checkpoint issued by Stability AI, and is the recommended way to perform inference.
From text to image
Import torch
from Diffuser Import stablediffusion3pipelinepipe=stablediffusion3pipeline.from_pretrained(
“stabilityai/stable-diffusion-3-medium-diffusers”torch_dtype = torch.float16).to(“cuda”) Image = Pipe (
“A cat holding a sign called HelloWorld”,negial_prompt =“”num_inference_steps =28,Guidance_scale =7.0,).images(0) image
From image to image
Import torch
from Diffuser Import stablediffusion3img2imgpipeline
from diffusers.utils Import load_image pipe = stablediffusion3img2imgpipeline.from_pretrained(
“stabilityai/stable-diffusion-3-medium-diffusers”torch_dtype = torch.float16).to(“cuda”)init_image = load_image(“https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png”)prompt = “Cat Wizard, Gandalf, Lord of the Rings, Details, Fantasy, Cute, Adorable, Pixar, Disney, 8K”
image = pipe(prompt, image = init_image).images(0) image
See the SD3 documentation here.
SD3 memory optimization
The SD3 uses three text encoders, one of which is a very large T5-XXL model. This allows you to run your model on a GPU with less than 24GB of VRAM, even when using FP16 accuracy.
To illustrate this, Diffusers integration features memory optimizations that allow SD3 to run on a wider range of devices.
Perform inference on offloading models
The most basic memory optimizations available in the diffuser allow you to offload the model’s components to the CPU during inference, saving memory while slightly increasing the latency of inference. Model offload moves model components to the GPU only if they need to run while keeping the remaining components in the CPU.
Import torch
from Diffuser Import stablediffusion3pipelinepipe=stablediffusion3pipeline.from_pretrained(
“stabilityai/stable-diffusion-3-medium-diffusers”torch_dtype = torch.float16)pipe.enable_model_cpu_offload()prompt = “The smiling cartoon dog is sitting on a table, a coffee mug as the room is on fire. “This is fine,” asserts the dog itself. ”
image = pipe(prompt).images(0))
Remove T5 text encoder during inference
Removing the memory-intensive 4.7B parameter T5-XXL text encoder during inference can significantly reduce the memory requirements of SD3 with slight loss of performance.
Import torch
from Diffuser Import stablediffusion3pipelinepipe=stablediffusion3pipeline.from_pretrained(
“stabilityai/stable-diffusion-3-medium-diffusers”,text_encoder_3 =nonetokenizer_3 =nonetorch_dtype = torch.float16).to(“cuda”)prompt = “The smiling cartoon dog is sitting on a table, a coffee mug as the room is on fire. “This is fine,” asserts the dog itself. ”
image = pipe(prompt).images(0))
Uses quantized version of the T5-XXL model
You can use the BitsandBytes library to load your T5-XXL models to 8 bits to further reduce your memory requirements.
Import torch
from Diffuser Import stablediffusion3pipeline
from transformer Import t5encodermodel, bitsandbytesconfig quantization_config = bitsandbytesconfig(load_in_8bit =truth)model_id = “stabilityai/stable-diffusion-3-medium-diffusers”
text_encoder = t5encodermodel.from_pretrained(model_id, subfolder =“text_encoder_3”,Quantization_config =Quantization_config, )pipe = stablediffusion3pipeline.from_pretrained(model_id,text_encoder_3 =text_encoder,device_map =“balance”torch_dtype = torch.float16)
You can find the complete code snippet here.
Memory optimization overview
All benchmark runs were conducted using the 2B version of the SD3 model of the A100 GPU with 80GB of VRAM using FP16 Precision and Pytorch 2.3.
Memory benchmarks use three iterations of pipeline calls to report the average inference time for 10 iterations of pipeline calls. Uses the default arguments for the stablediffusion3pipeline __call __() method.
Technical Inference Time (SECS) Memory (GB) Default 4.762 18.765 Offload 32.765 (~6.8X🔼) 12.0645 (~1.55X🔽) Offload + no T5 19.110 (~4.013x🔼) 4.266 (~4.398x) 8-bit T5 4.932 (~1.036x (~1.77x🔽)
Performance Optimization for SD3
To increase the delay in inference, torch.compile() can be used to obtain optimized computational graphs of VAE and transformer components.
Import torch
from Diffuser Import stablediffusion3pipeline torch.set_float32_matmul_precision(“expensive”) torch._inductor.config.conv_1x1_as_mm = truth
torch._inductor.config.coordinate_descent_tuning = truth
torch._inductor.config.epilogue_fusion = error
torch._inductor.config.coordinate_descent_check_all_directions = truth
pipe = stablediffusion3pipeline.from_pretrained(
“stabilityai/stable-diffusion-3-medium-diffusers”torch_dtype = torch.float16).to(“cuda”)pipe.set_progress_bar_config(disable =truth)pipe.transformer.to(memory_format = torch.channels_last)pipe.vae.to(memory_format = torch.channels_last)pipe.transformer = torch.compile(pipe.transformer, mode =“Max-Autotune”fullgraph =truth)pipe.vae.decode = Torch.compile(pipe.vae.decode, mode =“Max-Autotune”fullgraph =truth)prompt = “Photo of a cat holding a sign saying HelloWorld”,
for _ in range(3): _ = pipe(prompt = prompt, generator = torch.manual_seed(1)) image = pipe(prompt = prompt, generator = torch.manual_seed(1). image(0)image.save(“sd3_hello_world.png”))
For the complete script, see here.
I benchmarked the performance of Torch.comPile() on SD3 on a single 80GB A100 machine using FP16 Precision and Pytorch 2.3. I performed 10 iterations of pipeline inference calls with 20 diffusion steps. We found that the average inference time using the compiled version of the model was 0.585 seconds, four times faster than enthusiastic execution.
Fine adjustments to Dream Booth and Lora
Additionally, it offers a DreamBooth tweak script for SD3 that leverages LORA. The script can be used to efficiently fine-tune SD3 and acts as a reference to implement modified flow-based training pipelines. Other common implementations of fix flows include MinRF.
To start the script, first make sure that the appropriate setup and demo dataset are available (for example, this). For more information, see here. You can do so by installing PEFT and BITSANDBYTES.
export model_name =“stabilityai/stable-diffusion-3-medium-diffusers”
export instance_dir =“dog”
export output_dir =“dreambooth-sd3-lora”
Accelerate raunch train_dreambooth_lora_sd3.py \ -pretrained_model_name_or_path =${model_name} \ -instance_data_dir =${instance_dir} \ -output_dir =/raid/.cache/${output_dir} \ –mixed_precision =“FP16” \ –instance_prompt =“SKS dog photos” \ -Resolution = 1024 \ –Train_Batch_size = 1 \ –Gradient_accumulation_Steps = 4 \ – Learning_rate = 1e-5 \ –Report_to =“Wan Dob” \ -lr_scheduler =“Constant” \ -lr_warmup_steps = 0 \ -max_train_steps = 500 \ -weighting_scheme =“logit_normal” \ -validation_prompt =“Photo of a sks dog in a bucket” \ -validation_epochs = 25 \ -seed =“0” \ -PUSH_TO_HUB
Acknowledgments
We would like to thank the Stability AI team for achieving stable spread and providing early access. Thank you to Linoy for helping with the blog post thumbnail.