PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

Welcome back 👋

In the last two posts (Part 1 and Part 2), we looked at a wide range of tricks for architecture and training diffusion models. We evaluated each idea individually, measuring throughput, convergence speed, and final image quality, trying to understand what would actually move the needle.

In this post, I would like to answer more practical questions.

What happens when you combine all the tricks that worked?

Rather than optimizing one dimension at a time, stack on the most promising elements and see how far you can improve performance under a tight computing budget.

To make things concrete, we’re doing a 24-hour speedrun.

32 H200 ~$1500 total computing budget (2$/hour/GPU)

This is a far cry from the early days of widespread adoption, when training a competitive model could cost millions of dollars. Our goal here is to demonstrate how far the field has evolved and how far you can go with careful engineering in just one day of training.

This speedrun is more than just a fun experiment. It will be the basis for future extensive training recipes.

In addition to the results, we have also open sourced the code (on Github), including:

Training code used for this speedrun Experimental framework from previous blog post

So you can reproduce, modify and extend everything yourself.

training recipes

So let’s take a look at what this 24-hour run entailed.

X prediction and training in pixel space

We use the x-prediction formulation from Back to Basics: Let Denoising Generative Models Denoise (Li and He, 2025). As explained in part 2, this allows direct training in pixel space and completely eliminates the need for VAE. Use a patch size of 32 and a 256-dimensional bottleneck in the first token projection layer. This design keeps the sequence length under control, making pixel space training computationally manageable even at higher resolutions.

The length of the sequence at 512px is:

$512/32)^2=256$

At 1024px, the sequence length would be:

$1024 / 32)^2 = 1024$

Instead of following the usual 256px → 512px → 1024px schedule, start directly at 512px and then tweak at 1024px.

With controlled token counts and modern hardware, training in pixel space is no longer prohibitive. It’s simply a cleaner, more direct formula.

loss of perception

One very nice side effect of prediction. $x_0$

When a model outputs latent, perceptual monitoring becomes difficult. You need to decode it back to pixels or define a loss in the learned latent space that may or may not match human perception. Predicting pixels directly makes everything easy again. Perceptual losses can be incorporated exactly as originally designed.

We take inspiration from the paper “PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss” (Ma et al.), where the authors introduce an additional perceptual goal in addition to diffusion loss. They show that adding perceptual signals can significantly improve convergence speed and final visual quality.

This 24-hour run adds two auxiliary losses.

The idea is simple. In addition to the standard flow matching goal, we encourage the predicted clean image to match the target image in the perceptual feature space. LPIPS captures low-level perceptual similarities, while DINO features provide stronger semantic signals.

The general idea is the same as the paper, but I tweaked some details. In our experiments, we have empirically found that the following is more effective:

Apply perceptual loss to the pooled complete image rather than per-patch features. Applies perceptual loss at all noise levels.

These are implementation details, but our setup consistently gave better results.

We used weights of 0.1 for LPIPS loss and 0.01 for DINO perceptual loss, consistent with the values recommended in the original paper.

These losses are lightweight compared to the forward path of the main transformer, and our setup adds little overhead while providing consistent quality improvements.

Token routing with TRAD

To reduce the cost of each step, we use token routing with TREAD (Krause et al., 2025). This randomly selects some of the tokens, bypasses successive chunks of the transformer block, and reinjects them later, so nothing is dropped.

We chose TREAD over SPRINT primarily for simplicity, and because we felt that the extra complexity of SPRINT was not worth the significantly smaller additional compute savings in our settings (64 vs. 128 sequence length for 512 pixels of TREAD).

Follow the TREAD recipe to route 50% of the tokens from the second block to the penultimate block of the transformer.

Because routed models can look bad with vanilla CFGs, especially when undertrained, we implemented a simple self-guidance scheme inspired by the Guided Token-Sparse Diffusion model (Krause et al., 2025). It uses conditional predictions of dense pair routing to guide you, rather than relying on unconditional branches.

Aligning representation with REPA and DINOv3

REPA (Yu et al., 2024) was used for expression alignment.

For the teacher, we used DINOv3 (Siméoni et al. 2025), which had the highest quality improvement in previous experiments.

Specifically, in the 8th transformer block, we apply an alignment loss once with a loss weight of 0.5.

Because we combine REPA and TREAD routing, we only calculate alignment losses for non-routed tokens, i.e. tokens that actually pass through the block where we apply the loss. This maintains the consistency of the REPA signal and avoids comparing features of tokens that skipped computational passes.

Optimizer: Muon

I used the Muon optimizer using muon_fsdp_2’s FSDP implementation, as I saw a clear improvement over Adam in the previous run.

Muon only applies to 2D parameters (basically matrices). Everything else (biases, norms, embeddings, etc.) is optimized with Adam. Therefore, there are two parameter groups in the settings.

Group Applies to Key parameters used Muon 2D parameters lr=1e-4, momentum=0.95, nesterov=true, ns_steps=5 Adam All non-2D parameters lr=1e-4, betas=(0.9, 0.95), eps=1e-8

training settings

We trained using three publicly available synthetic datasets.

The schedule is basically as follows. 512 goes fast, 1024 sharpens.

512px for 100k steps with batch size 1024 1024px for 20k steps with batch size 512 without REPA.

We also keep the EMA of the weights for sampling and evaluation.

smoothing = 0.999 update_interval = 10ba ema_start = 0ba

Results and conclusions

Below are the evaluation curves we tracked throughout the run and some sample grids from the final checkpoint.

This is already a pretty stable place to run your day’s training. The model isn’t perfect yet (you can spot texture glitches, the occasional weird anatomy, and it can get a little wonky on really difficult prompts), but it’s clearly usable. Immediate tracking is strong, the overall aesthetic is consistent, and the 1024 stage primarily accomplishes what we want: sharpening details without breaking the composition.

The important point is that we are very close. The remaining issues look more like under-training artifacts and data diversity limitations than symptoms of structural flaws in the recipe. The failure mode is consistent with what would be expected from a model for which various data are not yet available. With more computing and wider coverage, this exact setup should continue to improve in a fairly predictable manner.

Zooming out, this speedrun highlights how far dissemination training has come. By combining pixel space training, efficient routing, representation tuning, and lightweight perceptual guidance, it is now possible to obtain meaningful models in about a day on budgets that seemed unrealistic not long ago.

What’s next?

This 24-hour run is just a starting point, not a finish line. Then keep pushing the same recipe a little more scale, repeating the dataset combinations and captions.

All code and configuration behind this Speedrun, as well as the complete experimental framework used in parts 1 and 2, is available in the PRX repository: https://github.com/Photoroom/PRX.

The exact training dataset used in this run is not redistributed, but the pipeline is fully configurable and designed to be easily adapted to your own data. You can connect different datasets, tune individual components (TREAD, REPA, perceptual loss, muons, etc.), and perform controlled experiments with minimal friction. Our goal is to make this a hands-on playground for rapid adoption research, and we hope the community will use it to explore, benchmark, and iterate on these technologies in their own settings.

If you have read this far, thank you for reading. We’d also love for you to join our Discord community. There, we will share our progress and results with PRX and discuss diffusion and text-to-image connections.

Goodbye here. Stay tuned for the next round of experiments! 🚀

versatileai

See Full Bio

What's Hot

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

From experiment to corporate reality

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

From experiment to corporate reality

Identify content created with Google’s AI tools

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Most Popular

Open Source DeepResearch – Unlocking Search Agents

How to use AI to support better tropical cyclone forecasting — Google DeepMind

CIO’s Governance Guide

Don't Miss

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

From experiment to corporate reality

Subscribe to Updates

What's Hot

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

training recipes

X prediction and training in pixel space

loss of perception

Token routing with TRAD

Aligning representation with REPA and DINOv3

Optimizer: Muon

training settings

Results and conclusions

What’s next?

Related Posts