Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Build physical AI using virtual simulation data

March 11, 2026

Pocket FM partners with OpenAI for AI-powered content creation – Indian Television Dot Com

March 11, 2026

How NVIDIA builds open data for AI

March 11, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, March 12
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»A simple optimization investigation of SDXL
Tools

A simple optimization investigation of SDXL

versatileaiBy versatileaiSeptember 6, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email



Stephen Liu's avatar

Open with colab

Stable Diffusion XL (SDXL) is the latest latent diffusion model with stability AI to generate high quality surreal images. It overcomes the challenges of previous stable diffusion models, such as spatially correct construction, such as getting hands and text correctly. Additionally, SDXL recognizes more context and requires fewer words to be prompted to produce a good looking image.

However, all of these improvements come at the expense of a significantly larger model. How big? The base SDXL model has 3.5B parameters (particularly UNET), which is about three times larger than the previous stable diffusion model.

I ran some tests on an A100 GPU (40 GB) to explore how to optimize SDXL for inference speed and memory usage. Each time the inference is performed, generate four images and repeat three times. While calculating the delay in inference, we only consider the final iterations of three iterations.

So, running SDXL immediately to be perfectly accurate and using the default caution mechanism will take 72.2 seconds, consuming 28GB of memory!

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(“stabilityai/stable-diffusion-xl-base-1.0”). In (“cuda”)pipe.unet.set_default_attn_processor()

This is not very practical and can slow you down as it often produces more than four images. And if you don’t have a stronger GPU, you will run into that frustrating out-of-memory error message. So, how can I optimize SDXL to speed up inference and reduce its memory usage?

Diffusers has many optimization tricks and techniques that can help you run memory-intensive models like SDXL. The two things we focus on are inference speed and memory.

costThe techniques described in this post are applicable to all pipelines.

Inference speed

As diffusion is a random process, there is no guarantee that you will get the image you like. In many cases, inferences need to be performed multiple times and iterated. Therefore, it is important to optimize for speed. This section focuses on increasing speed and reducing inference times by incorporating Pytorch 2.0’s memory-efficient attention and Torch.comPile.

Lower accuracy

Model weights are stored with a specific precision, expressed as floating point data types. The standard floating point data type is FLOAT32 (FP32), which can accurately represent a wide range of floating numbers. For inference, you should use Float16 (FP16) which captures a narrower range of floating numbers, as it often doesn’t need to be that accurate. This means that the FP16 only takes half the amount of memory it stores compared to the FP32, which means it’s twice as fast as the calculation is simple. Additionally, the latest GPU cards optimize the hardware to perform FP16 calculations, making them even faster.

With diffusers, you can use FP16 for inference by specifying the Torch.dtype parameter to transform the weights when the model is loaded.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”)pipe.unet.set_default_attn_processor()

Compared to the fully-optimized SDXL pipeline, using the FP16, 21.7GB of memory takes only 14.8 seconds. You’re speeding up your reasoning in almost a minute!

Memory-efficient Notes

The attention blocks used in transformer modules can become huge bottlenecks as memory increases secondary as the input sequence increases. This will take up a lot of memory immediately and leave out-of-memory error messages. 😬

Memory-efficient attentional algorithms, whether due to exploiting sparsity or tiling, try to reduce the burden on memory that calculates attention. These optimized algorithms were mostly available as third-party libraries that needed to be installed separately. But starting with Pytorch 2.0, this is no longer the case. Pytorch 2 has introduced Scaled Dot Product Caution (SDPA). This provides a fusion of flash attention in C++, memory efficient attention (XFormers), and Pytorch implementations. SDPA is probably the easiest way to speed up inference. If you are using diffusers with Pytorch ≥2.0, it will be enabled automatically by default!

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”))

Compared to the fully optimized SDXL pipeline, using FP16 and SDPA takes the same amount of memory, improving inference time to 11.4 seconds. Let’s use this as a new baseline. Compare other optimizations.

torch.compile

Pytorch 2.0 also introduced the Torch.comPile API for Just-in-Time (JIT) compilation of Pytorch code, into a kernel optimized for inference. Unlike other compiler solutions, torch.compile requires you to minimize existing code and is as easy as wrapping your model with features.

The mode parameters allow you to optimize memory overhead or inference speed during compilation. This increases flexibility.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”)pipe.unet = Torch.compile(pipe.unet, mode =“Reduced Overhead”fullgraph =truth))

Compared to previous baselines (FP16 + SDPA), wrapping UNET with torch.compile gives an inference time of 10.2 seconds.

Models It’s slow when you first compile a model, but once the model is compiled, all subsequent calls are faster!

Model Memory Footprint

Today’s models are getting bigger and bigger, and matching them with memory is a challenge. This section focuses on how to reduce the memory footprint of these huge models and can be run on consumer GPUs. These techniques use a distilled version of an autoencoder, which involves decoding the potential into images in several steps rather than all at once.

Model CPU offload

Model offload stores memory by loading UNET into GPU memory, and other components of the spreading model (text encoder, VAE) are loaded into the CPU. This allows UNET to perform multiple iterations until no GPU is needed.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16, ) pipe.enable_model_cpu_offload()

Compared to the baseline, we now need 20.2GB of memory, saving 1.5GB of memory.

Sequential CPU Offload

Another type of offload that can save more memory at the expense of slower inference is sequential CPU offload. Rather than offloading the entire model, like the weights of UNET models stored in different UNET submodules, it is offloaded to the CPU and only to the GPU just before the forward pass. Essentially, you can just load the model part each time and save even more memory. The only drawback is that it’s significantly slower as it loads and offloads the submodules multiple times.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16, ) pipe.enable_sequential_cpu_offload()

Compared to the baseline, this requires 19.9GB of memory, but increases the inference time to 67 seconds.

slice

In SDXL, a variational encoder (VAE) decodes sophisticated latents (predicted by UNET) into realistic images. The memory requirements for this step are scaled by the expected number of images (batch size). Depending on the image resolution and available GPU VRAM, it can be very memory intensive.

This is where “slicing” comes in handy. The decoded input tensor is split into slices and when decoding the calculation, it completes in a few steps. This saves memory and allows for larger batch sizes.

pipe = stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16,).“cuda”)pipe.enable_vae_slicing()

Using sliced ​​calculations reduces memory to 15.4GB. Adding sequential CPU offload will further reduce to 11.45GB, allowing you to generate 4 images (1024×1024) per prompt. However, sequential offloading also increases the delay in inference.

Cache calculation

Text-adjusted image generation models typically use a text encoder to calculate embeddings from input prompts. SDXL uses two text encoders! This contributes quite a bit during the inference latency. However, these embeddings have not been changed throughout the despreading process, so they can be pre-computed and reused. In this way, after calculating the text embedding, you can remove the text encoder from memory.

First, load the text encoder and its corresponding tonizer and calculate the embedding from the input prompt.

Tokenizers = (tokenizer, tokenizer_2) text_encoders = (text_encoder, text_encoder_2) (prompt_embeds, negial_prompt_embeds, negared_poored_polompt_embeds) = encode_prompt (tokenizers, text_encoders, prompt)

Next, flush the GPU memory and remove the text encoder.

del text_encoder, text_encoder_2, tokenizer, tokenizer_2
flash()

Now embedding is suitable for going directly to the SDXL pipeline.

from Diffuser Import stabledifusionxlpipelinepipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”,text_encoder =none,text_encoder_2 =none,Tokenizer =nonetokenizer_2 =nonetorch_dtype = torch.float16,).“cuda”) call_args = Dict(prompt_embeds = prompt_embeds,negial_prompt_embeds =negial_prompt_embeds,pooled_prompt_embeds = pooled_plompt_embeds =negial_pooled_prompt_embeds_per_prompt = num_image_per_per_prompt num_inference_steps,) image = pipe(** call_args).images(0))

Combined with SDPA and FP16, you can reduce memory to 21.9GB. Other techniques mentioned above for optimizing memory can also be used in cached calculations.

Small auto encoder

As mentioned before, VAE deciphers its potential into images. Naturally, this step is directly bottlenecked by the size of the VAE. So let’s use a smaller autoencoder! MadeByollin’s small automatic encoder, available hubs are only 10MB, distilled from the original VAE used by SDXL.

from Diffuser Import autoencodertiny pipe=stabledifusionxlpipeline.from_pretrained(
“stabilityai/stable-diffusion-xl-base-1.0”torch_dtype = torch.float16, )pipe.vae = autoencodertiny.from_pretrained(“Madebyollin/taesdxl”torch_dtype = torch.float16)pipe.to(“cuda”))

This setup reduces memory requirements to 15.6GB while simultaneously reducing inference latency.

tiny Autoencoder allows you to omit some of the more tweaked details from the image. Therefore, small autoencoders are more suitable for image previews.

Conclusion

To conclude and summarise savings from optimization:

Pus Profiling While profiled GPUs are profiled to measure the trade-off between inference delay and memory requirements, it is important to be aware of the hardware used. The above findings may not be converted evenly from hardware to hardware. For example, `torch.compile` appears to only benefit modern GPUs, at least for SDXL.

Technique Memory (GB) Inference Latency (MS) Unoptimized Pipeline 28.09 72200.5 FP16 21.72 14800.9 FP16 + SDPA (Default) 21.72 11413.0 Default + Torch. 67034.0 default + VAE slicing 15.40 11232.2 default + VAE slicing + sequential CPU offload 11.47 66869.2 default + pre-computed text embedding 21.85 11909.0 default + small auto encoder 15.48 10449.77

I hope these optimizations make it easier to run your favorite pipeline. Try these techniques and share your images with us! 🤗

Acknowledgements: Thank you to Pedro Cuenca for an informative review of the draft.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGrowth in the UK AI sector has recorded a record £2.9 billion investment
Next Article AI Business Humanity settles lawsuits over pirated chatbot training materials to pay authors $1.5 billion
versatileai

Related Posts

Tools

Build physical AI using virtual simulation data

March 11, 2026
Tools

How NVIDIA builds open data for AI

March 11, 2026
Tools

How AI innovation is paving the way to AGI — Google DeepMind

March 10, 2026
Add A Comment

Comments are closed.

Top Posts

Gemini’s Security Safeguard Advance – Google DeepMind

May 23, 202513 Views

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

May 23, 20258 Views

Competitive programming with AlphaCode-Google Deepmind

February 1, 20258 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Gemini’s Security Safeguard Advance – Google DeepMind

May 23, 202513 Views

Wix Get 1 hour to expand generative AI capabilities and accelerate product innovation – TradingView News

May 23, 20258 Views

Competitive programming with AlphaCode-Google Deepmind

February 1, 20258 Views
Don't Miss

Build physical AI using virtual simulation data

March 11, 2026

Pocket FM partners with OpenAI for AI-powered content creation – Indian Television Dot Com

March 11, 2026

How NVIDIA builds open data for AI

March 11, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?