A simple optimization investigation of SDXL

Stephen Liu's avatar

Stable Diffusion XL (SDXL) is the latest latent diffusion model with stability AI to generate high quality surreal images. It overcomes the challenges of previous stable diffusion models, such as spatially correct construction, such as getting hands and text correctly. Additionally, SDXL recognizes more context and requires fewer words to be prompted to produce a good looking image.

However, all of these improvements come at the expense of a significantly larger model. How big? The base SDXL model has 3.5B parameters (particularly UNET), which is about three times larger than the previous stable diffusion model.

I ran some tests on an A100 GPU (40 GB) to explore how to optimize SDXL for inference speed and memory usage. Each time the inference is performed, generate four images and repeat three times. While calculating the delay in inference, we only consider the final iterations of three iterations.

So, running SDXL immediately to be perfectly accurate and using the default caution mechanism will take 72.2 seconds, consuming 28GB of memory!

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(“stabilityai/stable-diffusion-xl-base-1.0”). In (“cuda”)pipe.unet.set_default_attn_processor()

This is not very practical and can slow you down as it often produces more than four images. And if you don’t have a stronger GPU, you will run into that frustrating out-of-memory error message. So, how can I optimize SDXL to speed up inference and reduce its memory usage?

Diffusers has many optimization tricks and techniques that can help you run memory-intensive models like SDXL. The two things we focus on are inference speed and memory.

costThe techniques described in this post are applicable to all pipelines.

Inference speed

As diffusion is a random process, there is no guarantee that you will get the image you like. In many cases, inferences need to be performed multiple times and iterated. Therefore, it is important to optimize for speed. This section focuses on increasing speed and reducing inference times by incorporating Pytorch 2.0’s memory-efficient attention and Torch.comPile.

Lower accuracy

Model weights are stored with a specific precision, expressed as floating point data types. The standard floating point data type is FLOAT32 (FP32), which can accurately represent a wide range of floating numbers. For inference, you should use Float16 (FP16) which captures a narrower range of floating numbers, as it often doesn’t need to be that accurate. This means that the FP16 only takes half the amount of memory it stores compared to the FP32, which means it’s twice as fast as the calculation is simple. Additionally, the latest GPU cards optimize the hardware to perform FP16 calculations, making them even faster.

With diffusers, you can use FP16 for inference by specifying the Torch.dtype parameter to transform the weights when the model is loaded.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”)pipe.unet.set_default_attn_processor()

Compared to the fully-optimized SDXL pipeline, using the FP16, 21.7GB of memory takes only 14.8 seconds. You’re speeding up your reasoning in almost a minute!

Memory-efficient Notes

The attention blocks used in transformer modules can become huge bottlenecks as memory increases secondary as the input sequence increases. This will take up a lot of memory immediately and leave out-of-memory error messages. 😬

Memory-efficient attentional algorithms, whether due to exploiting sparsity or tiling, try to reduce the burden on memory that calculates attention. These optimized algorithms were mostly available as third-party libraries that needed to be installed separately. But starting with Pytorch 2.0, this is no longer the case. Pytorch 2 has introduced Scaled Dot Product Caution (SDPA). This provides a fusion of flash attention in C++, memory efficient attention (XFormers), and Pytorch implementations. SDPA is probably the easiest way to speed up inference. If you are using diffusers with Pytorch ≥2.0, it will be enabled automatically by default!

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”))

Compared to the fully optimized SDXL pipeline, using FP16 and SDPA takes the same amount of memory, improving inference time to 11.4 seconds. Let’s use this as a new baseline. Compare other optimizations.

torch.compile

Pytorch 2.0 also introduced the Torch.comPile API for Just-in-Time (JIT) compilation of Pytorch code, into a kernel optimized for inference. Unlike other compiler solutions, torch.compile requires you to minimize existing code and is as easy as wrapping your model with features.

The mode parameters allow you to optimize memory overhead or inference speed during compilation. This increases flexibility.

Compared to previous baselines (FP16 + SDPA), wrapping UNET with torch.compile gives an inference time of 10.2 seconds.

Models It’s slow when you first compile a model, but once the model is compiled, all subsequent calls are faster!

Model Memory Footprint

Today’s models are getting bigger and bigger, and matching them with memory is a challenge. This section focuses on how to reduce the memory footprint of these huge models and can be run on consumer GPUs. These techniques use a distilled version of an autoencoder, which involves decoding the potential into images in several steps rather than all at once.

Model CPU offload

Model offload stores memory by loading UNET into GPU memory, and other components of the spreading model (text encoder, VAE) are loaded into the CPU. This allows UNET to perform multiple iterations until no GPU is needed.

Compared to the baseline, we now need 20.2GB of memory, saving 1.5GB of memory.

Sequential CPU Offload

Another type of offload that can save more memory at the expense of slower inference is sequential CPU offload. Rather than offloading the entire model, like the weights of UNET models stored in different UNET submodules, it is offloaded to the CPU and only to the GPU just before the forward pass. Essentially, you can just load the model part each time and save even more memory. The only drawback is that it’s significantly slower as it loads and offloads the submodules multiple times.

Compared to the baseline, this requires 19.9GB of memory, but increases the inference time to 67 seconds.

slice

In SDXL, a variational encoder (VAE) decodes sophisticated latents (predicted by UNET) into realistic images. The memory requirements for this step are scaled by the expected number of images (batch size). Depending on the image resolution and available GPU VRAM, it can be very memory intensive.

This is where “slicing” comes in handy. The decoded input tensor is split into slices and when decoding the calculation, it completes in a few steps. This saves memory and allows for larger batch sizes.

pipe = stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”)pipe.enable_vae_slicing()

Using sliced calculations reduces memory to 15.4GB. Adding sequential CPU offload will further reduce to 11.45GB, allowing you to generate 4 images (1024×1024) per prompt. However, sequential offloading also increases the delay in inference.

Cache calculation

Text-adjusted image generation models typically use a text encoder to calculate embeddings from input prompts. SDXL uses two text encoders! This contributes quite a bit during the inference latency. However, these embeddings have not been changed throughout the despreading process, so they can be pre-computed and reused. In this way, after calculating the text embedding, you can remove the text encoder from memory.

First, load the text encoder and its corresponding tonizer and calculate the embedding from the input prompt.

Tokenizers = (tokenizer, tokenizer_2) text_encoders = (text_encoder, text_encoder_2) (prompt_embeds, negial_prompt_embeds, negared_poored_polompt_embeds) = encode_prompt (tokenizers, text_encoders, prompt)

Next, flush the GPU memory and remove the text encoder.

del text_encoder, text_encoder_2, tokenizer, tokenizer_2
flash()

Now embedding is suitable for going directly to the SDXL pipeline.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”,text_encoder =none,text_encoder_2 =none,Tokenizer =nonetokenizer_2 =nonetorch_dtype = torch.float16,).“cuda”) call_args = Dict(prompt_embeds = prompt_embeds,negial_prompt_embeds =negial_prompt_embeds,pooled_prompt_embeds = pooled_plompt_embeds =negial_pooled_prompt_embeds_per_prompt = num_image_per_per_prompt num_inference_steps,) image = pipe(** call_args).images(0))

Combined with SDPA and FP16, you can reduce memory to 21.9GB. Other techniques mentioned above for optimizing memory can also be used in cached calculations.

Small auto encoder

As mentioned before, VAE deciphers its potential into images. Naturally, this step is directly bottlenecked by the size of the VAE. So let’s use a smaller autoencoder! MadeByollin’s small automatic encoder, available hubs are only 10MB, distilled from the original VAE used by SDXL.

from Diffuser Import autoencodertiny pipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16, )pipe.vae = autoencodertiny.from_pretrained(“Madebyollin/taesdxl”torch_dtype = torch.float16)pipe.to(“cuda”))

This setup reduces memory requirements to 15.6GB while simultaneously reducing inference latency.

tiny Autoencoder allows you to omit some of the more tweaked details from the image. Therefore, small autoencoders are more suitable for image previews.

Conclusion

To conclude and summarise savings from optimization:

Pus Profiling While profiled GPUs are profiled to measure the trade-off between inference delay and memory requirements, it is important to be aware of the hardware used. The above findings may not be converted evenly from hardware to hardware. For example, `torch.compile` appears to only benefit modern GPUs, at least for SDXL.

Technique Memory (GB) Inference Latency (MS) Unoptimized Pipeline 28.09 72200.5 FP16 21.72 14800.9 FP16 + SDPA (Default) 21.72 11413.0 Default + Torch. 67034.0 default + VAE slicing 15.40 11232.2 default + VAE slicing + sequential CPU offload 11.47 66869.2 default + pre-computed text embedding 21.85 11909.0 default + small auto encoder 15.48 10449.77

I hope these optimizations make it easier to run your favorite pipeline. Try these techniques and share your images with us! 🤗

Acknowledgements: Thank you to Pedro Cuenca for an informative review of the draft.

versatileai

See Full Bio

What's Hot

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Pillar Security raises $9 million to create AI security guardrails for businesses

Edimakor V4.2.0 unveils AI video tools at VEO 3

10 Best AI for PowerPoint presentations

Most Popular

Pillar Security raises $9 million to create AI security guardrails for businesses

Edimakor V4.2.0 unveils AI video tools at VEO 3

10 Best AI for PowerPoint presentations

Don't Miss

Introducing Gemini Omni

IMDA updates AI framework, OpenAI opens Singapore AI Lab

Nemotron-Labs Towards light-speed text generation using a diffuse language model

Subscribe to Updates

What's Hot

A simple optimization investigation of SDXL

Inference speed

Lower accuracy

Memory-efficient Notes

torch.compile

Model Memory Footprint

Model CPU offload

Sequential CPU Offload

slice

Cache calculation

Small auto encoder

Conclusion

Related Posts