Why diffuse text?
Although the AI research community has been exploring diffusion-based text generation for many years, applying it to large-scale models has remained a challenge. DiffusionGemma changes this by changing the way the model uses the hardware.
Trade-offs with traditional models
Most language models behave like typewriters, producing one token at a time from left to right. In the cloud, this is efficient because servers can aggregate thousands of user requests and share the hardware load. However, when run locally for a single user, this word-by-word process does not fully utilize the dedicated GPU or TPU and spends most of its time simply waiting for the next “keystroke.”
DiffusionGemma reverses this inefficiency. Rather than predicting words sequentially, it drafts entire paragraphs of 256 tokens at the same time. DiffusionGemma makes the most of your hardware by giving your computer’s processor a large amount of work at once. This upgrades the model’s inference from a single sequential typewriter to a giant printing press that stamps entire blocks of text simultaneously.

