The rapidly evolving landscape of large-scale language models (LLMs) has focused primarily on decoder-only architectures. Although these models have shown good functionality across a wide range of generation tasks, classic encoder/decoder architectures such as T5 (Text-to-Text Transfer Transformer) remain a popular choice for many real-world applications. Encoder-decoder models are often better for summarization, translation, QA, etc. due to their high inference efficiency, design flexibility, and richer encoder representations to understand the input. Nevertheless, powerful encoder/decoder architectures have not received much attention.
Today, we revisit this architecture and introduce T5Gemma, a new collection of encoder-decoder LLMs developed by transforming a pre-trained decoder-specific model into an encoder-decoder architecture through a technique called adaptation. T5Gemma is based on the Gemma 2 framework and includes adapted Gemma 2 2B and 9B models and a set of newly trained T5 size models (Small, Base, Large, XL). We are excited to release the pre-trained and instruction-tuned T5Gemma model to the community to open up new opportunities for research and development.
From decoder only to encoder-decoder
T5Gemma asks the following questions: Can we build a top-level encoder/decoder model based on a pre-trained decoder-only model? We answer this question by considering a technique called model adaptation. The core idea is to initialize the parameters of the encoder-decoder model using the weights of an already pre-trained decoder-only model and further adapt them through UL2 or PrefixLM-based pre-training.
Overview of our approach. Demonstrates how to initialize a new encoder/decoder model using parameters from a pretrained decoder-only model.
This adaptation method is highly flexible and allows for creative combinations of model sizes. For example, you can combine a large encoder and a small decoder (for example, a 9B encoder and a 2B decoder) to create an “unbalanced” model. This allows you to fine-tune the trade-off between quality and efficiency for certain tasks, such as summarization, where a deeper understanding of the input is more important than the complexity of the output produced.
Toward improving the trade-off between quality and efficiency
How is T5Gemma performing?
In our experiments, the T5Gemma model achieves performance equal to or better than the decoder-only Gemma model and nearly dominates the Pareto frontier of quality inference efficiency across several benchmarks, including SuperGLUE, which measures the quality of the learned representations.

Encoder/decoder models consistently provide superior performance for a given level of inference computing and lead on the quality efficiency front across a variety of benchmarks.
This performance advantage is not just theoretical. It also translates into real-world quality and speed. When measuring the actual latency of GSM8K (mathematical reasoning), T5Gemma has a clear win. For example, T5Gemma 9B-9B achieves higher accuracy than Gemma 2 9B, but has similar latency. What’s even more impressive is that even though the T5Gemma 9B-2B has significantly improved accuracy compared to the 2B-2B model, its latency is about the same as the much lower Gemma 2 2B model. Ultimately, these experiments demonstrate that encoder and decoder adaptation provides a flexible and powerful way to balance quality and inference speed.
Unlock basic and fine-tuned features
Could an encoder/decoder LLM have similar functionality to a decoder-only model?
Yes, T5Gemma shows promising functionality before and after instruction tuning.
After pre-training, T5Gemma achieves impressive results on complex tasks that require inference. For example, the T5Gemma 9B-9B scores over 9 points higher in GSM8K (Mathematical Reasoning) and over 4 points higher in DROP (Reading Reading) compared to the original Gemma 2 9B model. This pattern indicates that if the encoder/decoder architecture is initialized through adaptation, it may be possible to create a more capable and performant base model.

Detailed results for pre-trained models. We show how our adaptive model shows significant improvement on several inference-intensive benchmarks compared to the decoder-only Gemma 2.
These basic improvements through pre-training will set you up to see even more dramatic effects once your instructions are adjusted. For example, when comparing Gemma 2 IT and T5Gemma IT, the overall performance gap widens significantly. For T5Gemma 2B-2B IT, the MMLU score increased by nearly 12 points over Gemma 2 2B, and the GSM8K score increased from 58.0% to 70.7%. An adapted architecture not only potentially provides a better starting point, but also responds more effectively to instruction tuning, ultimately leading to a substantially more capable and useful final model.

Fine-tuning + detailed results for the RLHF model. We demonstrate post-training capabilities that significantly amplify the performance benefits of encoder/decoder architectures.
Explore the model: T5Gemma checkpoint release
We are very excited to introduce this new way to build powerful general-purpose encoder/decoder models by adapting pre-trained decoder-specific LLMs like Gemma 2. To accelerate further research and enable the community to build on this work, we are pleased to release a suite of T5Gemma checkpoints.
The release includes:
Multiple sizes: checkpoints for T5 size models (Small, Base, Large, and XL), Gemma 2-based models (2B and 9B), and additional models between T5 Large and T5 XL. Multiple variants: pre-trained and instruction-tuned models. Flexible configuration: A powerful and efficient unbalanced 9B-2B checkpoint for exploring trade-offs between encoder and decoder sizes. Different training goals: Models trained with either PrefixLM or UL2 goals to provide state-of-the-art generative performance or representation quality.
We hope these checkpoints will be a valuable resource for exploring your model’s architecture, efficiency, and performance.
Get started with T5Gemma
I can’t wait to see what you build with T5Gemma. For more information please see the following link:
Read the paper to learn about the research behind this project. Explore the model’s capabilities and fine-tune it for your own use case using Colab notebooks.

