The first Gemma model was launched early last year and has since grown into a thriving Gemmaverse with over 160 million total downloads. This ecosystem includes a family of more than a dozen models specialized in everything from safety protection to medical applications, and most impressively, countless innovations from the community. From innovators like Roboflow building enterprise computer vision to Tokyo University of Science creating a high-performance Japanese Gemma variant, your work has shown us the way forward.
Building on this incredible momentum, we are excited to announce the full release of Gemma 3n. Last month’s preview gave us a glimpse, and today we’re unlocking the full power of this mobile-first architecture. Gemma 3n is designed for the developer community that helped shape Gemma. Supported by your favorite tools like Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, and MLX, it’s easy to tweak and deploy for your specific on-device applications. This post is detailed for developers. We explore some of the innovations behind Gemma 3n, share new benchmark results, and show you how to start building today.
What’s new with Gemma 3n?
Gemma 3n represents a major advance in on-device AI, bringing powerful multimodal capabilities to edge devices with performance previously only seen in last year’s cloud-based Frontier models.
Achieving this leap in on-device performance required a fundamental rethinking of the model. The foundation is Gemma 3n’s unique mobile-first architecture, and it all starts with MatFormer.
MatFormer: One model, different sizes
At the core of Gemma 3n is the MatFormer (🪆Matryoshka Transformer) architecture, a new nested transformer built for elastic inference. Think of it like a matryoshka doll. Large models include fully functional smaller versions. This approach extends the concept of matryoshka representation learning from mere embeddings to all transformer components.
As shown in the figure above, during MatFormer training of a 4B effective parameter (E4B) model, a 2B effective parameter (E2B) submodel is simultaneously optimized within it. This now provides two powerful features and use cases for developers.
1: Pre-extracted models: You can directly download and use either the main E4B model for best functionality or standalone E2B sub-models that are already extracted, providing up to 2x faster inference.
2: Custom sizes with Mix-n-Match: For more control to suit your specific hardware constraints, you can create a custom-sized spectrum of models between E2B and E4B using a method called Mix-n-Match. This technique allows us to precisely slice the parameters of the E4B model, mainly by adjusting the hidden dimensions (8192 to 16384) of the feedforward network per layer and selectively skipping some layers. We are releasing MatFormer Lab, a tool that shows how to obtain these optimal models, identified by evaluating different settings on benchmarks such as MMLU.
MMLU scores for pre-trained Gemma 3n checkpoints at various model sizes (using Mix-n-Match)
Looking ahead, the MatFormer architecture also paves the way for flexible execution. Although not part of the implementation launched today, this feature allows a single deployed E4B model to dynamically switch between E4B and E2B inference paths on the fly, optimizing performance and memory usage in real-time based on the current task and device load.
Per-layer embedding (PLE): Improving memory efficiency
The Gemma 3n model incorporates layer-by-layer embedding (PLE). This innovation is tailored for on-device deployments to dramatically improve model quality without increasing the high-speed memory footprint required by device accelerators (GPU/TPU).
The total number of parameters for the Gemma 3n E2B and E4B models is 5B and 8B, respectively, but using PLE, most of these parameters (the embeddings associated with each layer) can be loaded onto the CPU and computed efficiently. This means that only the core transformer weight (about 2B for E2B and 4B for E4B) needs to be stored in the typically more constrained accelerator memory (VRAM).

Layer-wise embedding allows you to use Gemma 3n E2B while loading only ~2B parameters into the accelerator.
KV cache sharing: speeding up long context processing
Processing long inputs, such as sequences derived from audio or video streams, is essential to many advanced on-device multimodal applications. Gemma 3n introduces KV cache sharing, a feature designed to significantly reduce time to first token for streaming response applications.
KV cache sharing optimizes how the model handles the initial input processing stage (often referred to as the “prefill” phase). Middle-tier keys and values from local and global attention are shared directly with all top-tiers, delivering a notable 2x improvement in prefill performance compared to Gemma 3 4B. This means that the model can capture and understand long prompt sequences much faster than before.
Understanding Speech: Introducing Speech-to-Text Conversion and Translation
Gemma 3n uses an advanced audio encoder based on the Universal Speech Model (USM). The encoder generates a token every 160 milliseconds of audio (approximately 6 tokens per second), and these tokens are integrated as input to the language model to provide a detailed representation of the sound context.
This integrated audio functionality enables key features for on-device development, including:
Automatic speech recognition (ASR): Enables high-quality speech-to-text conversion directly on your device. Automatic speech translation (AST): Translate spoken words into text in another language.
We have seen particularly strong AST results for translations between English and Spanish, French, Italian, and Portuguese, offering great potential for developers targeting applications in these languages. For tasks such as speech translation, thought chain prompts can greatly improve results. For example:
User transcribes the following audio segment in Spanish and then translates it to English: Model
plain text
At launch, a Gemma 3n encoder is implemented to process audio clips of up to 30 seconds. However, this is not a fundamental limitation. The underlying audio encoder is a streaming encoder that can process audio of arbitrary length with additional long-form audio training. Follow-up implementation enables low-latency, long-time streaming applications.
MobileNet-V5: The new state-of-the-art vision encoder
In addition to integrated audio capabilities, Gemma 3n is equipped with a new high-efficiency vision encoder MobileNet-V5-300M, delivering cutting-edge performance for multimodal tasks on edge devices.
MobileNet-V5 is designed for flexibility and power even on constrained hardware, providing developers with the following capabilities:
Multiple input resolutions: Natively supports 256×256, 512×512, and 768×768 pixel resolutions to balance performance and detail for specific applications. Broad visual understanding: Co-trained on a wide range of multimodal datasets, it excels at a wide range of image and video understanding tasks. High throughput: Process up to 60 frames per second on Google Pixel for real-time, on-device video analytics and interactive experiences.
This level of performance is achieved through multiple architectural innovations, including:
Advanced foundations of the MobileNet-V4 block, including universal inverse bottleneck and mobile MQA. Significantly scaled-up architecture featuring a hybrid deep pyramid model that is 10x larger than the largest MobileNet-V4 variant. New multi-scale Fusion VLM adapter that enhances token quality to improve accuracy and efficiency.
Benefiting from a new architectural design and advanced distillation techniques, MobileNet-V5-300M significantly outperforms Gemma 3’s baseline SoViT (trained on SigLip, no distillation). The Google Pixel Edge TPU delivers a 13x speedup with quantization (6.5x without quantization), requires 46% fewer parameters, and a 4x smaller memory footprint, while significantly increasing accuracy for visual language tasks.
We look forward to sharing more details about the work behind this model. Stay tuned for future MobileNet-V5 technical reports. This report details model architecture, data scaling strategies, and advanced distillation techniques.
Making Gemma 3n accessible from day one was a top priority. We’re proud to partner with many great open source developers, including contributions from the teams behind AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM to ensure broad support across popular tools and platforms.
But this ecosystem is just the beginning. The real power of this technology lies in what you build with it. That’s why we’re launching the Gemma 3n Impact Challenge. Your mission: To build products for a better world using Gemma 3n’s unique on-device, offline, and multimodal capabilities. For a $150,000 prize, we’re looking for compelling video stories and “wow” factor demos that demonstrate real-world impact. Join the challenge and help build a better future.
Get started with Gemma 3n now
Are you ready to explore the possibilities of Gemma 3n now? Here’s how.
Direct experimentation: Try out Gemma 3n in just a few clicks using Google AI Studio. Gemma models can also be deployed directly to Cloud Run from AI Studio. Learn and integrate: Quickly integrate Gemma into your projects with our comprehensive documentation, or get started with our inference and fine-tuning guides.

