Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

March 3, 2026

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

March 3, 2026

From experiment to corporate reality

March 2, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, March 3
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»PRX Part 3 — Train a Text-to-Image Model in 24 Hours!
Tools

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

versatileaiBy versatileaiMarch 3, 2026No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Welcome back 👋

In the last two posts (Part 1 and Part 2), we looked at a wide range of tricks for architecture and training diffusion models. We evaluated each idea individually, measuring throughput, convergence speed, and final image quality, trying to understand what would actually move the needle.

In this post, I would like to answer more practical questions.

What happens when you combine all the tricks that worked?

Rather than optimizing one dimension at a time, stack on the most promising elements and see how far you can improve performance under a tight computing budget.

To make things concrete, we’re doing a 24-hour speedrun.

32 H200 ~$1500 total computing budget (2$/hour/GPU)

This is a far cry from the early days of widespread adoption, when training a competitive model could cost millions of dollars. Our goal here is to demonstrate how far the field has evolved and how far you can go with careful engineering in just one day of training.

This speedrun is more than just a fun experiment. It will be the basis for future extensive training recipes.

In addition to the results, we have also open sourced the code (on Github), including:

Training code used for this speedrun Experimental framework from previous blog post

So you can reproduce, modify and extend everything yourself.

training recipes

So let’s take a look at what this 24-hour run entailed.

X prediction and training in pixel space

We use the x-prediction formulation from Back to Basics: Let Denoising Generative Models Denoise (Li and He, 2025). As explained in part 2, this allows direct training in pixel space and completely eliminates the need for VAE. Use a patch size of 32 and a 256-dimensional bottleneck in the first token projection layer. This design keeps the sequence length under control, making pixel space training computationally manageable even at higher resolutions.

The length of the sequence at 512px is:

(512/32)2=256 (512/32)^2=256
(512/32)2=256

At 1024px, the sequence length would be:

(1024/32)2=1024 (1024 / 32)^2 = 1024
(1024/32)2=1024

Instead of following the usual 256px → 512px → 1024px schedule, start directly at 512px and then tweak at 1024px.

With controlled token counts and modern hardware, training in pixel space is no longer prohibitive. It’s simply a cleaner, more direct formula.

loss of perception

One very nice side effect of prediction. x0x_0×0​ Being able to work directly in pixel space means you can reuse the entire classic computer vision toolbox.

When a model outputs latent, perceptual monitoring becomes difficult. You need to decode it back to pixels or define a loss in the learned latent space that may or may not match human perception. Predicting pixels directly makes everything easy again. Perceptual losses can be incorporated exactly as originally designed.

We take inspiration from the paper “PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss” (Ma et al.), where the authors introduce an additional perceptual goal in addition to diffusion loss. They show that adding perceptual signals can significantly improve convergence speed and final visual quality.

This 24-hour run adds two auxiliary losses.

The idea is simple. In addition to the standard flow matching goal, we encourage the predicted clean image to match the target image in the perceptual feature space. LPIPS captures low-level perceptual similarities, while DINO features provide stronger semantic signals.

The general idea is the same as the paper, but I tweaked some details. In our experiments, we have empirically found that the following is more effective:

Apply perceptual loss to the pooled complete image rather than per-patch features. Applies perceptual loss at all noise levels.

These are implementation details, but our setup consistently gave better results.

We used weights of 0.1 for LPIPS loss and 0.01 for DINO perceptual loss, consistent with the values ​​recommended in the original paper.

These losses are lightweight compared to the forward path of the main transformer, and our setup adds little overhead while providing consistent quality improvements.

Token routing with TRAD

To reduce the cost of each step, we use token routing with TREAD (Krause et al., 2025). This randomly selects some of the tokens, bypasses successive chunks of the transformer block, and reinjects them later, so nothing is dropped.

We chose TREAD over SPRINT primarily for simplicity, and because we felt that the extra complexity of SPRINT was not worth the significantly smaller additional compute savings in our settings (64 vs. 128 sequence length for 512 pixels of TREAD).

Follow the TREAD recipe to route 50% of the tokens from the second block to the penultimate block of the transformer.

Because routed models can look bad with vanilla CFGs, especially when undertrained, we implemented a simple self-guidance scheme inspired by the Guided Token-Sparse Diffusion model (Krause et al., 2025). It uses conditional predictions of dense pair routing to guide you, rather than relying on unconditional branches.

Aligning representation with REPA and DINOv3

REPA (Yu et al., 2024) was used for expression alignment.

For the teacher, we used DINOv3 (Siméoni et al. 2025), which had the highest quality improvement in previous experiments.

Specifically, in the 8th transformer block, we apply an alignment loss once with a loss weight of 0.5.

Because we combine REPA and TREAD routing, we only calculate alignment losses for non-routed tokens, i.e. tokens that actually pass through the block where we apply the loss. This maintains the consistency of the REPA signal and avoids comparing features of tokens that skipped computational passes.

Optimizer: Muon

I used the Muon optimizer using muon_fsdp_2’s FSDP implementation, as I saw a clear improvement over Adam in the previous run.

Muon only applies to 2D parameters (basically matrices). Everything else (biases, norms, embeddings, etc.) is optimized with Adam. Therefore, there are two parameter groups in the settings.

Group Applies to Key parameters used Muon 2D parameters lr=1e-4, momentum=0.95, nesterov=true, ns_steps=5 Adam All non-2D parameters lr=1e-4, betas=(0.9, 0.95), eps=1e-8

training settings

We trained using three publicly available synthetic datasets.

The schedule is basically as follows. 512 goes fast, 1024 sharpens.

512px for 100k steps with batch size 1024 1024px for 20k steps with batch size 512 without REPA.

We also keep the EMA of the weights for sampling and evaluation.

smoothing = 0.999 update_interval = 10ba ema_start = 0ba

Results and conclusions

Below are the evaluation curves we tracked throughout the run and some sample grids from the final checkpoint.

This is already a pretty stable place to run your day’s training. The model isn’t perfect yet (you can spot texture glitches, the occasional weird anatomy, and it can get a little wonky on really difficult prompts), but it’s clearly usable. Immediate tracking is strong, the overall aesthetic is consistent, and the 1024 stage primarily accomplishes what we want: sharpening details without breaking the composition.

The important point is that we are very close. The remaining issues look more like under-training artifacts and data diversity limitations than symptoms of structural flaws in the recipe. The failure mode is consistent with what would be expected from a model for which various data are not yet available. With more computing and wider coverage, this exact setup should continue to improve in a fairly predictable manner.

Zooming out, this speedrun highlights how far dissemination training has come. By combining pixel space training, efficient routing, representation tuning, and lightweight perceptual guidance, it is now possible to obtain meaningful models in about a day on budgets that seemed unrealistic not long ago.

What’s next?

This 24-hour run is just a starting point, not a finish line. Then keep pushing the same recipe a little more scale, repeating the dataset combinations and captions.

All code and configuration behind this Speedrun, as well as the complete experimental framework used in parts 1 and 2, is available in the PRX repository: https://github.com/Photoroom/PRX.

The exact training dataset used in this run is not redistributed, but the pipeline is fully configurable and designed to be easily adapted to your own data. You can connect different datasets, tune individual components (TREAD, REPA, perceptual loss, muons, etc.), and perform controlled experiments with minimal friction. Our goal is to make this a hands-on playground for rapid adoption research, and we hope the community will use it to explore, benchmark, and iterate on these technologies in their own settings.

If you have read this far, thank you for reading. We’d also love for you to join our Discord community. There, we will share our progress and results with PRX and discuss diffusion and text-to-image connections.

Goodbye here. Stay tuned for the next round of experiments! 🚀

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAnnouncing Gemma 3n Preview: Powerful and Efficient Mobile-First AI
versatileai

Related Posts

Tools

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

March 3, 2026
Tools

From experiment to corporate reality

March 2, 2026
Tools

Identify content created with Google’s AI tools

March 1, 2026
Add A Comment

Comments are closed.

Top Posts

Open Source DeepResearch – Unlocking Search Agents

February 7, 20258 Views

How to use AI to support better tropical cyclone forecasting — Google DeepMind

February 25, 20263 Views

CIO’s Governance Guide

January 22, 20263 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Open Source DeepResearch – Unlocking Search Agents

February 7, 20258 Views

How to use AI to support better tropical cyclone forecasting — Google DeepMind

February 25, 20263 Views

CIO’s Governance Guide

January 22, 20263 Views
Don't Miss

PRX Part 3 — Train a Text-to-Image Model in 24 Hours!

March 3, 2026

Announcing Gemma 3n Preview: Powerful and Efficient Mobile-First AI

March 3, 2026

From experiment to corporate reality

March 2, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?