Fast LORA inference for flux using diffusers and PEFTs.

Benjamin Bossan's avatar

The Lora adapter offers enormous customization for models of all shapes and sizes. When it comes to image generation, it can empower models with different styles, different characters, and more. Sometimes it can be used to reduce delays inference. Therefore, their importance is of paramount importance, especially when it comes to customization and fine-tuning models.

In this post, we take a Flux.1-Dev model for Text-to-Image generation (~2.3x) for widespread popularity and adoption, and for how to optimize inference speed when using LORAS. Over 30K adapters are trained (reported on the face hub platform to hug). So its importance to the community.

Even though we show fluid speedups, our belief is that we are general enough to apply to our recipes as well as other models.

If you can’t wait to get started with your code, check out the accompanying code repository.

When providing LORAs, it is common to use HotSwap (swap different LORAs and replace them). Lora changes the base model architecture. Furthermore, the loras may differ from one another. Each could have different ranks and different layers to target for adaptation. To explain these dynamic properties of LORA, the necessary steps must be taken to ensure that the optimizations applied are robust.

For example, you can apply torch.compile to a model loaded with a particular LORA to get speedups of inference latency. However, the moment you exchange LORA for a different configuration (potentially different configuration), you run into recompilation issues, causing a slower inference.

You can also fuse LORA parameters to the base model parameters, perform compilation, and fix the LORA parameters when loading new parameters. However, this approach encounters recompilation issues every time inference is performed due to potential architecture-level changes.

Our optimization recipes take into account the above situations as realistically as possible. Below are key components of your optimization recipe:

Flash Note 3 (FA3) Quantization from Torch.comPile FP8 Tochao HotSwapping-Ready

Of the above, it should be noted that FP8 quantization is lossy, but often offers the most frightening speed memory trade-off. I’ve tested the recipe mainly using NVIDIA GPUs, but it should work on AMD GPUs too.

In my previous blog posts (post 1 and post 2), I have already explained the benefits of using the first three components of an optimization recipe. Applying them one by one is just a few lines of code.

from Diffuser Import Diffusionpipeline, torchaoconfig
from diffusers.quantizers Import pipelinequantizationConfig
from utils.fa3_processor Import flashfluxattnprocessor3_0
Import torch pipe = diffusionpipeline.from_pretrained(
“Black-Forest-Labs/Flux.1-dev”torch_dtype = torch.bfloat16, Quantization_config = pipelinequantizationConfig(Quant_Mapping = {“transformer”:torchaoconfig(“float8dq_e4m3_row”)}) ). In (“cuda”) pipe.transformer.set_attn_processor(flashfluxattnprocessor3_0()) pipe.transformer.compile(fullgraph =truthmode=“Max-Autotune”)pipe_kwargs = {
“prompt”: “A cat holding a sign called HelloWorld”,
“height”: 1024,
“width”: 1024,
“guidance_scale”: 3.5,
“num_inference_steps”: 28,
“max_sequence_length”: 512} image = pipe(** pipe_kwargs).images(0))

This is where the FA3 processor comes from.

When I try to replace the Lora with a compiled diffusion transformer (Pipe.Transformer) without triggering a recompilation, the problem starts to surface.

Usually, loading and unloading LORA requires recompilation, beating the speed advantages gained from compilation. Thankfully there is a way to avoid the need for recompilation. By passing hotswap = true, the diffuser leaves the model architecture unchanged and only exchanges the weights of the Roller adapter itself, but this does not require recompilation.

pipe.enable_lora_hotswap(target_rank = max_rank)pipe.load_lora_weights()pipe.transformer.compile(Mode=“Max-Autotune”fullgraph =truth) Image = pipe(** pipe_kwargs).images(0)pipe.load_lora_weights(,hotswap =truth) Image = pipe(** pipe_kwargs).images(0))

(As a reminder, since Torch.comPile is a just-in-time compiler, calls to the first pipe are slower. However, subsequent calls should be significantly faster.)

This usually allows you to replace the Lora without recompiling, but there are limitations.

You must provide the maximum rank for all LORA adapters in advance. So if you have one adapter with rank 16 and another with 32 adapters, you need to pass MAX_RANK = 32. A hot-wapped Roller adapter can only target the same layer or subset of layers that the first Roller is targeting. Text encoder targeting is not supported yet.

For more information about diffuser hot swapping and its limitations, see the hot swapping section of the documentation.

The advantage of this workflow becomes apparent when you look at the delay in inference without using compilation with hot swapping.

Optional Time (s) ⬇️Speed-up (vs baseline) ⬆§Notes Baseline 7.8910 – Baseline Optimized 3.5464 2.23 x Hot Swapping + No Redemption Hiccup (FP8 by default) NO_FP8 4.3520 1.81 x Optimized V3) Baseline + Compile 5.0920 1.55 x Compile on On, suffering from intermittent recompilation stalls no_fa3_fp8 5.0850 1.55 x FA3 and FP8 no_compile_fp8 7.5190 1.05 x Failure FP8 QUANTIZINATIN

Important takeouts:

The “regular + compile” option provides a decent speedup over the regular option, but it causes recompilation issues and increases overall execution time. The benchmark does not present compilation times. The best speedup is achieved when hot swapping (also known as the “optimization” option) eliminates recompilation issues. The “Optimized” option has FP8 quantization enabled, which can lead to poor quality. You’ll get a decent amount of speedup (“NO_FP8” option) without using FP8. For demonstration purposes, we use two lora pools for hot swapping in the compilation. See the attached code repository for the complete code.

The optimization recipes discussed so far assume access to powerful GPUs like the H100. But what can we do if we are limited to using consumer GPUs such as the RTX 4090? Let’s look into it.

flux.1-dev (without LORA) runs ~33GB of memory using the BFLOAT16 data type. Depending on the size of the Lora module, this memory footprint can be increased even further, without using optimization. Many consumer GPUs like the RTX 4090 only have 24GB. Throughout the rest of this section, we consider the RTX 4090 machine as a testbed.

First, to enable end-to-end execution of Flux.1-dev, you can apply CPU offloading, which allows components that are not needed to perform the current calculations to be offloaded to the CPU, freeing up more accelerator memory. By doing so, you can run the entire pipeline at ~22GB in 35.403 seconds on your RTX 4090. Enabling compilation reduces latency to 31.205 seconds (1.12 times faster). As for the code, it’s just a few lines.

pipe = diffusionpipeline.from_pretrained(
“Black-Forest-Labs/Flux.1-dev”torch_dtype = torch.bfloat16, ) pipe.enable_model_cpu_offload() pipe.transformer.compile_repeated_blocks(fullgraph =truth) Image = pipe(** pipe_kwargs).images(0))

Note that FP8 quantization was not applied here as it is not supported by CPU offloading and compiling (support problem thread). Therefore, simply applying FP8 quantization to flux transformers is not sufficient to alleviate memory fatigue problems. In this example, I decided to delete it.

So, to take advantage of the FP8 quantization scheme, we need to find a way to do it without CPU offloading. For Flux.1-dev, if you apply more quantization to the T5 text encoder, you should be able to load and run the full pipeline at 24GB. Below is a comparison of results when the T5 text encoder is quantized and not quantized (NF4 quantization from bitsandbytes).

As you can see in the diagram above, quantizing the T5 text encoder does not cause much quality loss. Combining the quantized T5 text encoder and FP8 Quantized flux transformer with TORCH.comPile, it ranges from 32.27 seconds (a massive 3.3x speedup) to 9.668 seconds without noticeable quality degradation.

It is possible to generate images in 24 GB of VRAM without quantizing the T5 text encoder, but that would slightly complicate the pipeline of our generation.

Now there is a way to run the FLUX.1-DEV pipeline on the RTX 4090 using FP8 quantization. To optimize LORA inference on the same hardware, previously established optimization recipes can be applied. FA3 is not supported on the RTX 4090, so we use the newly added T5 quantization in the mix to stick to the following optimization recipe:

FP8 quantization TORCH.COMPILE HOTSWAPPING-READYT5 quantization (using NF4)

The table below shows the number of inference delays in different combinations to which the above components are applied.

Option Key ARGS Flag Time Time ⬇️Speed up (baseline vs) ⬆️Baseline disable_fp8 = false disable_compile = true Quantize_t5 = true offload = false 23.6060 – false disable_compile = false Quanize_t5 = false

Quick Notes:

Compilation offers a massive speedup of twice the amount on the baseline. Other options resulted in OOM errors even when offloading is enabled.

This post outlined a fast LORA inference optimization recipe with flux and showed a considerable speedup. Our approach combines Flash Atterness 3, Torch.comPile, and FP8 quantization to ensure hot swapping capabilities without recompilation issues. On high-end GPUs like the H100, this optimized setup offers 2.23x speedup on the baseline.

For the consumer GPU, especially the RTX 4090, we addressed memory limits by introducing T5 text encoder quantization (NF4) and leveraging local editing. This comprehensive recipe achieved a substantial speedup of 2.04 times, inferring LORAs that promote liquidity and performance even when VRAM is limited. A key insight is that the benefits of LORA can be fully realized in a variety of hardware configurations by carefully managing compilation and quantization.

Hopefully, the recipes in this post will encourage you to optimize LORA-based use cases and benefit from quick inference.

resource

Below is a list of important resources quoted throughout this post.

What's Hot

Flaws in AI benchmarks put company budgets at risk

Introducing SafeCoder

Kenya releases National AI Strategy 2025-2030

Flaws in AI benchmarks put company budgets at risk

Introducing SafeCoder

OpenAI spreads $600 billion bet on cloud AI across AWS, Oracle, and Microsoft

Bending Spoons’ acquisition of AOL shows the value of legacy platforms

Build a healthcare robot from simulation to deployment with NVIDIA Isaac

CEO of stablecoin giant Circle says international law needs to be updated for a “machine-governed economic system”

Most Popular

Bending Spoons’ acquisition of AOL shows the value of legacy platforms

Build a healthcare robot from simulation to deployment with NVIDIA Isaac

CEO of stablecoin giant Circle says international law needs to be updated for a “machine-governed economic system”

Don't Miss

Flaws in AI benchmarks put company budgets at risk

Introducing SafeCoder

Kenya releases National AI Strategy 2025-2030

Subscribe to Updates

What's Hot

Fast LORA inference for flux using diffusers and PEFTs.

table of contents

resource

Related Posts