The Gemma 3N was announced as a preview in Google I/O. This is a model designed to run locally on hardware, so the community on the device is very excited. In addition to that, it is natively multimodal and supports image, text, audio and video input 🤯
Today, Gemma 3N is finally available in the most used open source library. This includes Transformers & Timm, MLX, llama.cpp (text input), Transformers.js, Ollama, Google AI Edge, and more.
This post quickly goes through a practical snippet to show you how to use models in these libraries and how easy it is to fine-tune it for other domains.
Models released today
This is the Gemma 3N release collection
Two model sizes are released today, with two variations each (base and directives). Model names follow non-standard nomenclature. They are called Gemma-3n-e2b and Gemma-3n-e4b. The E before the parameter count is an effective shorthand. The actual parameter counts are 5B and 8B respectively, but thanks to improved memory efficiency, only 2B and 4B are required in VRAM (GPU memory).
So these models behave like 2B and 4B in terms of hardware support, but punch more than 2B/4B in terms of quality. The E2B model can run on just 2GB of GPU RAM, while the E4B can run on just 3GB of GPU RAM.
Model details
In addition to the language decoder, the Gemma 3N uses an audio and Vision encoder. We highlight the main features below and explain how they were added to the transformer and TIMM as they are references to other implementations.
Vision Encoder (MobileNet-V5). Gemma 3N uses a newer version of MobileNet: MobileNet-V5-300, added to the new version of TIMM released today. It has 300m parameters. Supports resolutions of 256×256, 512×512, and 768×768. It achieves 60 fps on Google Pixel, surpassing Vit Giants while using 3x more parameters. Audio encoder: Based on the Universal Audio Model (USM). Process audio in 160ms chunks. Enables the audio to text and translation features (eg, English to Spanish/French). Gemma 3N architecture and language model. The architecture itself has been added to the new version of Transformers released today. This implementation is classified as TIMM for image encoding, so there is a single reference implementation of the MobileNet architecture.
Architecture highlights
Machiner Architecture: Nested transformer designs similar to Matryoshka’s embedding allow you to extract subsets of different layers as if they were individual models. E2B and E4B were trained together, and E2B was constructed as a submodel of E4B. Users can “mix and match” layers depending on the hardware characteristics and memory budget. Layer-by-layer embedding (PLE): Reduces accelerator memory usage by offloading embeddings to the CPU. This is why the E2B model takes almost the same GPU memory as if it were a 2B parameter model while it has the actual parameters of 5B. KV Cache Sharing: Accelerate long context processing of audio and video, achieving twice as fast prefill as the Gemma 3 4b.
Performance and Benchmarks:
LMARENA Score: E4B is the first sub-10B model to achieve a score of 1300+. MMLU Score: The Gemma 3N shows competitive performance across a variety of sizes (E4B, E2B, and several mixed N-match configurations). Multilingual Support: Supports 140 languages ​​of text and 35 languages ​​of multimodal interaction.
Demo space
The easiest way to check your model is to use a dedicated hugging face space for your model. Here you can try different prompts using different modalities.
📱Space
Trans reasoning
If you want to install the latest version of TIMM (for Vision Encoder) and transformer to perform inference or tweak it.
PIP Installation-u -q TIMM PIP Installation-U -Q Trans
Pipeline reasoning
The easiest way to get started with Gemma 3N is to use pipeline abstractions for transformers.
Import torch
from transformer Import Pipeline Pipe = Pipeline (
“Images text to text”model =“Google/gemma-3n-e4b-it”device =“cuda”torch_dtype = torch.bfloat16) message = ({
“role”: “user”,
“content”:({“type”: “image”, “URL”: “https://huggingface.co/datasets/arig23498/demo-data/resolve/main/airplane.jpg”},{“type”: “Sentence”, “Sentence”: “Explaining this image”})}) output = pipe(text = messages, max_new_tokens =32))
printing(output(0) ()“generated_text”)(-1) ()“content”)))
output:
The image shows a futuristic and sophisticated aircraft that soars through the sky. It is designed with a distinctive, almost alien aesthetic and features a wide body and large
Detailed reasoning using a transformer
Writes the Model_Generation function that initializes the model and processor from the hub and performs prompt processing and model inference.
from transformer Import AutoProcessor, AutomodelFORIMAGETEXTTOTEXT
Import Torch Model_id = “Google/gemma-3n-e4b-it”
processor = autoprocessor.from_pretrained(model_id) model = automodelforimageTextTotext.from_pretrained(model_id).o(device)
def model_generation(Models, messages): inputs = processor.apply_chat_template(messages, add_generation_prompt=truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”,) input_len = inputs(“input_ids”). shape(-1) inputs = inputs.o(model.device, dtype = model.dtype)
and torch.inference_mode(): generation = model.generate(** inputs, max_new_tokens =32,disable_compile =errorgeneration = generation(:, input_len 🙂 decoded = processor.batch_decode(generation, skip_special_tokens =truth))
printing(decoded)0)))
The model supports all modalities as inputs, so here is a brief code description of how to use them via a transformer.
Text only
Message = ({
“role”: “user”,
“content”:({“type”: “Sentence”, “Sentence”: “What is the capital of France?”})}) model_generation (model, message)
output:
The French capital is **Paris**.
Audio and Interleave
Message = ({
“role”: “user”,
“content”:({“type”: “Sentence”, “Sentence”: “Translate the next audio segment in English.”},{“type”: “audio”, “audio”: “https://huggingface.co/datasets/arig23498/demo-data/resolve/main/speech.wav”},)}) model_generation (model, message)
output:
Send a text to the microphone. I’ll go home late tomorrow.
Image/Video and Interleave
Video support is done as a collection of image frames
Message = ({
“role”: “user”,
“content”:({“type”: “image”, “image”: “https://huggingface.co/datasets/arig23498/demo-data/resolve/main/airplane.jpg”},{“type”: “Sentence”, “Sentence”: “Please explain this image.”})}) model_generation (model, message)
output:
This image shows a futuristic and refined white plane against a transparent blue sky. The plane is tilted
Inference for MLX
The Gemma 3N comes with day 0 support for MLX across all three modalities. Please upgrade your MLX-VLM installation.
PIP Installation-U MLX -VLM
Let’s start your vision:
python -m mlx_vlm.generate – model google/gemma-3n-e4b-it-Max-Tokens 100 – Temperature 0.5 – Prompt “Please explain this image in detail.” -image https://huggingface.co/datasets/arig23498/demo-data/resolve/main/airplane.jpg
And audio:
python -m mlx_vlm.generate – model google/gemma-3n-e4b-it-Max-Tokens 100 – Temperature 0.0 – Prompt “Translate the next audio segment in English.” -Audio https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/audio-samples/jfk.wav
Inference to llama.cpp
In addition to MLX, gemma 3n (text only) works out of the box with llama.cpp. Make sure to install llama.cpp/ollama from the source.
See the installation instructions for llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md
You can do it like this:
llama-server -hf ggml-org/gemma-3n-e4b-it-gguf:q8_0
Inference to transformers.js and onnxruntime
Finally, we have also released the ONNX weight of the GEMMA-3N-E2B-IT model variant, allowing for flexible deployment across a variety of runtimes and platforms. For JavaScript developers, GEMMA3N is integrated into Transformers.js and is available in version 3.6.0.
For more information about how to run models with these libraries, see the Using Model Cards section.
Fine tweaks with free Google Colab
Given the size of the model, it is very useful to fine-tune it for a specific downstream task across the modality. To make the model easier to fine-tune, we have created a simple notebook that you can experiment with free Google Colab!
They also provide dedicated notebooks for fine-tuning with audio tasks, allowing you to easily fit your model to speech datasets and benchmarks.
Hugging the Gemma recipe on your face
This release also introduces the embracing Face Gemma Recipes Repository. Find the notebooks and scripts that run and tweak the model.
Use a model from the Gemma family to add recipes. Feel free to open the issue and create a pull request in the repository.
Conclusion
I’m always excited to host Google and Gemma family models. We hope that the community will come together and make the most of these models. Multi-modal, small size, very capable and makes a great model release!
If you would like to discuss the model in more detail, start the discussion just below this blog post. We are happy to help!
Thank you to Arthur, Cyril, Rashan, Lisandre and everyone who has embraced the faces that have been made available to the community, taking care of integration!