
Over the past few months, it has emerged in the use of trans-based diffusive scaffolds for image (T2I) generation from high-resolution text. These models use transformer architectures as components of the diffusion process, instead of the UNET architecture that was popular among many early diffusion models. Thanks to the nature of the transformer, these backbones exhibit excellent scalability with models ranging from 0.6B to 8B parameters.
As the model grows, the memory requirements increase. The problem is intensifying because a diffusion pipeline is usually made up of several components, such as a text encoder, a diffusion backbone, and an image decoder. Additionally, modern diffusion pipelines use multiple text encoders. For example, there are three cases of stable diffusion.
These high memory requirements make these models difficult to use in consumer GPUs, slower adoption and more difficult to experiment. This post shows how to improve memory efficiency of transformer-based diffusion pipelines by leveraging Quanto’s quantization utility from the Diffusers library.
table of contents
spare
For a detailed introduction to Quanto, see this post. In short, Quanto is a quantization toolkit built in Pytorch. This is part of hugging Face Optimum, a set of tools for hardware optimization.
Model quantization is a popular tool among LLM practitioners, but not so many in diffusion models. Quanto helps fill this gap and offers memory savings that save little or no quality degradation.
For benchmarking, use the H100 GPU in the following environments:
This is the default for performing calculations on FP16 unless otherwise specified. We chose not to quantize the VAE to prevent the problem of numerical instability. The benchmark code can be found here.
At the time of this writing, there is the following transformer-based diffusion pipeline for the generation of images from the text of the diffuser:
There is also Latte, a transformer-based text-to-video generation pipeline.
For brevity, we limit our research to three: Pixart-Sigma, stable diffusion 3, and aura flow. The table below shows the number of parameters for the diffusion backbone.
It is worth keeping in mind that this post mainly focuses on memory efficiency at a small or negligible cost of inference latency.
Quantize diffused piperine using Quanto
Quantize a model with Quanto is easy.
from optimum.quanto Import Freeze, qfloat8, Quantize
from Diffuser Import pixartsigmapipeline
Import Torch Pipeline = pixartsigmapipeline.from_pretrained(
“Pixart-Alpha/Pixart-Sigma-XL-2-1024-MS”torch_dtype = torch.float16).to(“cuda”) Quantize (pipeline.transformer, weights = qfloat8) Freeze (pipeline.transformer)
Call Quantize() of the module to Quantized and specify what to quantize. In the above case, simply quantize the parameters and leave the activation intact. Quantized to FP8 data type. Finally, call Freeze() to replace the original parameter with the quantized parameter.
You can then successfully invoke this pipeline.
Image = Pipeline (“Ghibli style, fantasy scenery with castle”).images(0)FP16 diffusion transformer for FP8
When using FP8, if the latency is slightly higher and there is little degradation in quality, you will notice the next memory savings.
Batch-size quantized memory (GB) latency (seconds) 1 none 12.086 1.200 1 FP8 11.547 1.540 4 none 12.087 4.482 4 FP8 11.548 5.109
You can quantify the text encoder in the same way.
Quantize(pipeline.text_encoder, weights = qfloat8) freze(pipeline.text_encoder)
Text encoders are also trans models and can be quantized. Quantizing both the text encoder and the diffusion backbone provides much greater memory improvements.
Batch-size quantized quantized memory (GB) latency (seconds) 1 FP8 FALSE 11.547 1.540 1 FP8 TRUE 5.363 1.601 4 FP8 FALSE 11.548 5.109 4 FP8 TRUE 5.364 5.141
Quantizing the text encoder produces results that are very similar to the previous case.
Generality of observation
Quantization with a text encoder and a diffusion backbone generally works for the model you tried. Stable diffusion 3 is a special case as it uses three different text encoders. I found that quantization of the second text encoder doesn’t work well. Therefore, we recommend the following alternative:
The following table shows the ideas for expected memory savings for various text encoder quantization combinations (diffusion transformers are quantized in all cases):
Batch-Size Quantization TE 1 Quantum Quantize TE 3 Memory (GB) Latency (seconds) 1 FP8 1 1 1 8.200 2.858 1✅FP80 0 1 8.294 2.781 1 FP8 1 1 0 14.384 2.833 1 FP8 0 1 0 14.475 2.8181 8.325 2.875 1
Quantized Text Encoder: 1 Quantized Text Encoder: 3 Quantized Text Encoder: 1 and 3
Other survey results
BFLOAT16 is usually better on the H100
With BFLOAT16, supported GPU architectures such as the H100 and 4090 are faster. The table below shows some of the PIXART figures measured on the H100 reference hardware.
Batch size accuracy quantization memory (GB) latency (seconds) quantization1 FP16 INT8 5.363 1.538 TRUE 1 BF16 INT8 5.364 1.454 TRUE 1 FP16 FP8 5.363 1.601
QINT8’s Promise
Quantization using QINT8 (instead of QFloat8) has proven to be generally better in terms of inference latency. This effect is more pronounced when the attention QKV projection (calls fuse_qv_projections() on the diffuser) horizontally, which in turn denses the INT8 kernel to speed up calculations. Below is some evidence about Pixart.
Batch-size quantized memory (GB) latency (seconds) quantized QKV projection1 INT8 5.363 1.538 TRUE FALSE 1 INT8 5.536 1.504 TRUE 4 INT8 5.365 5.129 True False 4 Int8 5.538 4.989 True True True True True
How about INT4?
Additionally, we experimented with QINT4 when using BFLOAT16. This applies only to BFLOAT16 on H100, as no other configurations are supported yet. Using QINT4 can be expected to further improve memory consumption at the expense of increased inference latency. There is no native hardware support for INT4 calculations, which increases latency. The weights are transferred using 4 bits, but in BFLOAT16 the calculations are still performed. The table below shows the results for Pixart-Sigma.
Batch Size Quantize TE Memory (GB) Latency (seconds) 1 No 9.380 7.431 1 Yes 3.058 7.604
However, please note that aggressive discretization of INT4 can cause the final result to be a hit. This is why, in general, transformer-based models, the final projection layer is usually excluded from quantization. In Quanto we do this.
Quantize(pipeline.transformer, weights=qint4, exclude=“proj_out”) Freeze (pipeline.transformer)
“proj_out” corresponds to the final layer of pipeline.transformer. The following table shows the results for various settings:
Quantize: No, Layer Exclusion: No Quantize TE: No, Layer Exclusion: “proj_out” Quantize TE: Yes, Layer Exclusion: No Quantize TE: Yes, Layer Exclusion: “proj_out”
A common practice to restore lost image quality is to perform quantization-conscious training. This is also supported by Quanto. This technique is outside the scope of this post. Please feel free to contact us if you are interested!
All the results of the experiments in this post can be found here.
Bonus – Save and load Quanto’s Diffusers model
Quantized diffuser models can be saved and loaded.
from Diffuser Import pixarttransformer2dmodel
from optimum.quanto Import QUANIZED PIXARTTRANSFORMER2DMODEL, QFLOAT8 model = pixartTransformer2dmodel.from_pretrained(“Pixart-Alpha/Pixart-Sigma-XL-2-1024-MS”subfolder =“transformer”)qmodel = quantizedpixarttransformer2dmodel.quantize(model,weights = qfloat8)qmodel.save_pretrained(“Pixart-Sigma-FP8”))
The resulting checkpoint is 587MB in size, not the original 2.44GB. You can then load it.
from optimum.quanto Import QuantizedPixArtTransFormer2DModel
Import Torch Transformer = QuantizedPixartTransformer2DModel.from_pretrained(“Pixart-Sigma-FP8”)Transformer.to(device =“cuda”,dtype = torch.float16)
And use it in diffusionpipeline:
from Diffuser Import Diffusionpipeline
Import torch pipe = diffusionpipeline.from_pretrained(
“Pixart-Alpha/Pixart-Sigma-XL-2-1024-MS”,Trans=nonetorch_dtype = torch.float16,).“cuda”)pipe.transformer = transformer prompt = “A little cactus with a happy face in the Sahara desert.”
image = pipe(prompt).images(0))
In the future, we can expect to see that the transformer will pass directly when initializing the pipeline, and this will work.
pipe = pixartsigmapipeline.from_pretrained(“pixart-alpha/pixart-sigma-xl-2-1024-ms”,
– Transformer = none,
+Trans =Trans,
torch_dtype = torch.float16,).to(“cuda”)
The implementation of the QuantizedPixartTransformer2DModel is available here for reference. If Quanto supports Diffusers models for savings and loading, open the issue here and mention @sayakpaul.
Tip
Based on your requirements, we recommend applying different types of quantization to different pipeline modules. For example, you can use FP8 for text encoders, but you can use INT8 for diffusion transformers. Thanks to the flexibility of the diffuser and Quanto, this can be done seamlessly. Quantization can also be combined with other memory optimization techniques from diffusers, such as enable_model_cpu_offload(), to optimize for use cases.
Conclusion
In this post, we demonstrated how to quantize a transformer model from a diffuser and optimize memory consumption. Further quantization of text encoders related to the mix makes the quantization more noticeable. We hope you will apply some of your workflows to your projects and benefit from them.
Thank you Pedro Cuenca for providing extensive reviews on the post.