Accelerating SD turbo and SDXL turbo inference with ONNX runtime and olive

SD Turbo and SDXL Turbo are two fast-generated text-to-image models that can generate executable images in one step. This is a significant improvement in over 30 steps that is often required in previous stable diffusion models. SD Turbo is a stable diffusion 2.1 distilled version, while SDXL Turbo is a distilled version of SDXL 1.0. Previously, we have shown how to accelerate stable diffusion inference in the ONNX runtime. The ONNX runtime not only offers performance benefits when used with SD and SDXL turbos, but also allows you to access your models in languages other than Python, such as C# and Java.

Performance improvements

In this post, we introduce optimizations to ONNX runtime CUDA and Tensort running providers that speed up SD turbo and SDXL turbo inference on NVIDIA GPUs.

The ONNX runtime tested Pytorch for all (batch size, number of steps) combinations. The SDXL turbo model had a throughput gain of 229% and the SD turbo model had a 120% throughput gain. The ONNX Runtime CUDA has particularly good performance over dynamic shapes, but also shows a noticeable improvement over Pytorch for static shapes.

How to run SD Turbo and SDXL Turbo

To accelerate inference with the ONNX runtime CUDA running provider, access the Face-optimized versions of SD Turbo and SDXL Turbo.

Models are generated by Olive, an easy-to-use model optimization tool known by the hardware. Note that for best performance, FP16 VAE must be enabled on the command line, as shown in the shared optimized version. See SD Turbo and SDXL Turbo Usage Examples for information on how to run SD and SDXL pipelines using ONNX files hosted on Hugging Face.

Instead, follow the instructions here to accelerate your inference with the ONNX runtime Tensort running provider.

Below is an example of image generation using the SDXL turbo model guided by a text prompt.

python3 demo_txt2img_xl.py \ –version xl-turbo \
“The jacket, vibrant colours like the cinema, intricate masterpieces, golden ratios, very detailed little cute gremlins.”

Figure 1. A cute little gremlin wearing a jacket image generated at a text prompt using an SDXL turbo.

Note that the example image is generated in four steps, demonstrating the ability of the SD and SDXL turbos to generate viable images in fewer steps than previous stable diffusion models.

For a user-friendly way to try out a stable diffusion model, see Automatic1111 SD WebUI ONNX Runtime Extension. This extension allows for an optimized execution of stable, diffused UNET models on NVIDIA GPUs, using the ONNX runtime CUDA execution provider to perform inferences on olively optimized models. Currently, expansion is only optimized for stable diffusion 1.5. SD Turbo and SDXL Turbo models are also available, but performance optimizations are still in progress.

Stable diffusion applications in C# and Java

Taking advantage of the cross-platform, performance and usability benefits of the ONNX runtime, community members are also contributing to their own sample and UI tools using stable spreading using the ONNX runtime.

These community contributions include OnnxStack, a .NET library based on previous C# tutorials that provide users with various features of many different stable diffusion models when performing inferences with C# and ONNX runtimes.

Additionally, Oracle has released a stable diffusion sample using Java that performs inference on top of the ONNX runtime. This project is also based on the C# tutorial.

Benchmark results

I benchmarked SDD turbo and SDXL turbo models using the standard_ND96AMSR_A100_V4 VM using a Lenovo desktop with an A100-SXM4-80GB and an RTX-4090 GPU (WSL Ubuntu 20.04) and generated an image of 512X512 using the LCM schedule. Results are measured using these specifications.

onnxruntime-gpu == 1.17.0 (built from source) torch == 2.1.0A0+32F93B1 TENSORRT == 8.6.1 Transformer

We recommend that you use the steps linked in the Examples section to reproduce these results.

I used the SDXL-VAE-FP16-FIX in testing the SDXL turbo because the original VAE of the SDXL turbo cannot be run with FP16 accuracy. There is a slight discrepancy between the output and the original VAE output, but the decoded image is close enough for most purposes.

The static-shaped Pytorch pipeline has channel last memory format and torch.compile applied in reduced overhead mode.

The following chart shows the throughput of different images per second and different (batch size, number of steps) combinations of images per second for different frameworks. It is noteworthy that the labels above each bar show speed-up rates and torch compilation.

We chose to use 1 and 4 steps as both SD Turbo and SDXL Turbo can generate images that can be executed in just one step, but can usually create the highest quality images in 3-5 steps.

SDXL Turbo

The graph below shows the image throughput per second for SDXL turbo models with both static and dynamic shapes. Results were collected on A100-SXM4-80GB GPUs in different (batch size, number of steps) combinations. For dynamic shapes, Tensort Engine supports batch sizes 1-8 and image sizes 512×512 to 768×768, but is optimized for batch size 1 and image size 512×512.

SD turbo

The following two graphs show the image throughput per second for an SD turbo model with both static and dynamic shapes of an A100-SXM4-80GB GPU.

The final graph set shows the image throughput per second for SD turbo models with both static and dynamic shapes on the RTX-4090 GPU. In this dynamic shape test, the Tensorrt engine is built for batch sizes 1-8 (optimized for batch size 1) and fixed image size 512×512 due to memory limitations.

How fast will the SD Turbo and SDXL Turbo be using the ONNX Runtime?

These results show that the ONNX runtime is significantly more than the static and dynamic shape Pytorch for all of the combinations of the indicated (batch, step): This conclusion applies to both model sizes (SD turbo and SDXL turbo) and both GPUs. In particular, the ONNX runtime using CUDA (dynamic shape) was shown to be 229% faster than a torch keen on (batch, step) combination (1, 4).

Furthermore, the ONNX runtime using the Tensortort execution provider is slightly better for static shapes, given that the ORT_TRT throughput is higher than the corresponding ORT_CUDA throughput of most (batch, step) combinations. In general, static shapes are preferred if the user knows the batch and image size at the graph definition time (for example, the user plans to generate only images with batch size 1 and image size 512×512). In these situations, static shape performance is faster. However, if the user decides to switch to another batch and/or image size, Tensort will need to create a new engine (meaning double the engine file in the disk) and switch the engine (meaning an additional time spent loading the new engine).

On the other hand, the ONNX runtime with a CUDA running provider is often a suitable choice for the dynamic shapes of SD Turbo and SDXL turbo models when using an A100-SXM4-80GB GPU, but the ONNX runtime with a Tensort running provider is slightly better with most batches, batches and batches dynamic shapes when using the RTX-4090 GPU. The advantage of using dynamic shapes is that users can perform inference more quickly if they don’t know the batch and image size until the graph run time (for example, run batch size 1 and image size 512×512 for one image and image size 512×512 with another image size 512×768). When using dynamic shapes in these cases, the user simply builds and saves one engine rather than switching engines during inference.

GPU Optimization

In addition to the techniques introduced in the previous stable spreading blog, the following optimizations have been applied by the ONNX runtime to generate the SD turbo and SDXL turbo results outlined in this post:

Enables CUDA graphs for static shape inputs. Add flash note V2. Removes additional output for the text encoder (keeping the hidden state output specified by the Clip_SKIP parameter). Add SkipGroupNorm Fusion to fuse group normalization and use the additional nodes that precede it.

Additionally, it supports new features such as LORA weights for potential consistency models (LCMS).

Next Steps

In the future, we plan to continue to improve stable spreading operations by updating demos to support new features such as IP adapters and stable video spreading. ControlNet support will also be available soon.

We are also working on plans to optimize SD Turbo and SDXL Turbo performance with existing stable spreading web UI extensions and add support for both models to Windows UI developed by members of the ONNX Runtime Community.

Additionally, there will be a tutorial on how to run SD Turbo and SDXL Turbo using C# and the ONNX Runtime. In the meantime, check out our previous tutorial on stable diffusion.

resource

Take a look at some of the resources discussed in this post.

versatileai

See Full Bio

What's Hot

SAP and Google Cloud introduce agent commerce architecture

Can research agents keep secrets?

Computer vision helps retailers improve productivity

SAP and Google Cloud introduce agent commerce architecture

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Huawei fills the AI gap left in China by Apple

Trends and insights with new multilingual and long-form tracks

Can research agents keep secrets?

Most Popular

Huawei fills the AI gap left in China by Apple

Trends and insights with new multilingual and long-form tracks

Can research agents keep secrets?

Don't Miss

SAP and Google Cloud introduce agent commerce architecture

Can research agents keep secrets?

Computer vision helps retailers improve productivity

Subscribe to Updates

What's Hot

Accelerating SD turbo and SDXL turbo inference with ONNX runtime and olive

Performance improvements

How to run SD Turbo and SDXL Turbo

Stable diffusion applications in C# and Java

Benchmark results

SDXL Turbo

SD turbo

How fast will the SD Turbo and SDXL Turbo be using the ONNX Runtime?

GPU Optimization

Next Steps

resource

Related Posts