Today we’re introducing the Gemma 4 12B, our latest model designed to bring agent multimodal intelligence directly to your laptop. Gemma 4 12B packages powerful features within a reduced memory footprint to bridge the gap between the edge-friendly E4B and the more advanced 26B Mixture of Experts (MoE). It’s also the first mid-sized model to feature native audio input.
Thanks to the developer community, the Gemma 4 model has been downloaded over 150 million times. We’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We can’t wait to see what you build with this latest addition.
Here’s a summary of what makes Gemma 4 12B unique:
Novel integration architecture: No multimodal encoder. Vision and audio inputs flow directly to the LLM backbone. Advanced Inference: Enable powerful multi-step inference and agent workflows with benchmark performance close to the 26B model. Laptop-friendly: Small enough to run locally with just 16 GB of VRAM or integrated memory. Open and accessible: Released under the Apache 2.0 license with support from the entire developer ecosystem. Drafting Preparation: Gemma 4 12B is equipped with a multi-token forecast (MTP) drafter to reduce latency.
Together, these features bring advanced multimodal functionality to everyday hardware without sacrificing speed or inference. Now let’s take a closer look at how the Gemma 4 12B accomplishes this.
Run state-of-the-art agents locally
The Gemma 4 12B offers performance close to the larger 26B MoE model on standard benchmarks, but with less than half the total memory footprint. It’s small enough to run locally on a consumer laptop with 16 GB of RAM, enabling powerful multimodal and agent experiences on the machine.

