Gemma 4 VLA Demo for Jetson Orin Nano Super

When you talk to Gemma 4, she will decide for herself whether she needs to look through her webcam to answer you. Everything runs on a local Jetson Orin Nano Super. You speak → Parakeet STT → Gemma 4 → (webcam if necessary) → Kokoro TTS → Speaker

Press SPACE to record, press SPACE again to stop. This is a simple VLA. The model makes its own decisions about what to do based on the context of what is being asked, without using keyword triggers or hard-coded logic. If your question requires Gemma to open her eyes, she will take the photo, interpret it, and answer with that context in mind. She’s not explaining the photos, she’s using what she sees to answer real questions.

to be honest? It’s pretty impressive to see this run on the Jetson Orin Nano. 🙂

get code

The complete script for this tutorial can be found in my Google_Gemma repository on GitHub next to the Gemma 2 demo.

👉 github.com/asierarranz/Google_Gemma

Obtain using one of the following (select one):

git clone https://github.com/asierarranz/Google_Gemma.git
CD Google_Gemma/Gemma4 wget https://raw.githubusercontent.com/asierarranz/Google_Gemma/main/Gemma4/Gemma4_vla.py

All you need is that one file (Gemma4_vla.py). Get the STT/TTS model and audio assets from Hugging Face on the first run.

hardware

What we used:

NVIDIA Jetson Orin Nano Super (8 GB) Logitech C920 Webcam (with built-in microphone) USB speakers USB keyboard (press SPACE)

Any webcam, USB microphone, and USB speaker that Linux recognizes should work, including but not limited to these devices.

Step 1: System package

If you’re new to Jetson, let’s install the basics.

sudo apt update sudo apt install -y \ git build-essential cmakecurl wget pkg-config \ python3-pip python3-venv python3-dev \ alsa-utilspulseaudio-utils v4l-utils psmisc \ ffmpeg libsndfile1

build-essential and cmake are only required if you use the native llama.cpp route (Option A in Step 4). The rest are for audio, webcam, and Python.

Step 2: Python environment

python3 -m venv .venv
sauce .venv/bin/activate pip install –upgrade pip pip install opencv-python-headless onnx_asr kokoro-onnx sound file hugging face hub numpy

Step 3: Free up RAM (optional but recommended)

Note: This step may not be necessary. But we’re pushing this 8 GB board pretty hard on a fairly high-performance model, so giving it some headroom will make the whole experience smoother, especially if you’ve been using Docker or other heavy stuff before.

These are the only commands that worked well for me. Please use it if it helps.

add swap

Although swapping doesn’t speed up inference, it does act as a safety net during model loading so that OOM doesn’t get killed at the worst possible moment.

sudo fallocate -l 8G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile
echo ‘/swapfile none swap sw 0 0’ |Sudo tea -a /etc/fstab

kill the captives of memory

sudo systemctl stop docker 2>/dev/null || truth
sudo systemctl stopcontainerd 2>/dev/null || truth
pkill -f tracker-miner-fs-3 || truth
pkill -f gnome software || truth
Free -h

Close everything you don’t need, including browser tabs and IDE windows. Every MB counts.

If you go the Docker route in step 4, obviously don’t stop Docker here. You will need it. Still, I’ll kill the rest.

Still low on RAM?

According to our testing, Q4_K_M (native build) and Q4_K_S (Docker) work well on 8 GB boards after completing the above cleanup. But if there’s something else you can’t get rid of and you’re still low on memory, you can step down to the Q3 Quant, which is the same model but a little less sleek, but noticeably lighter. Just swap the filenames in step 4.

gemma-4-E2B-it-Q3_K_M.gguf # instead of Q4_K_M

But honestly, if you can, stick with Q4_K_M. That’s the sweet spot.

Step 4: Serve Gemma 4

Before launching the demo, you need to run llama-server on Gemma 4. Build llama.cpp natively on Jetson. This gives you the best performance and full control over your vision projector, which is required for VLA demos.

Build llama.cpp

CD ~ Git clone https://github.com/ggml-org/llama.cpp.git
CD llama.cpp cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=“87” \ -DGGML_NATIVE=ON \ -DCMAKE_BUILD_TYPE=cmake –build build –config Release -j4

Download the model and vision projector

mkdir -p ~/model&& CD ~/models wget -O gemma-4-E2B-it-Q4_K_M.gguf \ https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf wget -O mmproj-gemma4-e2b-f16.gguf \ https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf

The mmproj file is a vision projector. Gemma will be blind without this, so don’t skip it.

start the server

~/llama.cpp/build/bin/llama-server \ -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \ –mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \ -c 2048 \ –image-min-tokens 70 –image-max-tokens 70 \ –ubatch-size 512 –batch-size 512 \ –host 0.0.0.0 –port 8080 \ -ngl 99 –flash-attn on \ –no-mmproj-offload –jinja -np 1

One flag worth mentioning: -ngl 99 tells llama-server to push all layers of the model to the GPU (99 is just “as many as the model has”). If you run into memory issues, you can lower that number and offload fewer layers to the GPU and the rest to the CPU. However, with this setup, all layers on the GPU should work fine.

Make sure it’s running

From another device:

curl -s http://localhost:8080/v1/chat/completions \ -H “Content type: application/json” \ -d ‘{“model”:”gemma4″,”message”:({“role”:”user”,”content”:”Hello!”}),”max_tokens”:32}’ \ | python3 -m json.tool

If JSON is returned, you’re good to go.

Step 5: Find your microphone, speakers, and webcam

microphone

record -l

Find your USB microphone. In our case, the C920 was displayed as plughw:3,0.

speaker

pactl list short sink

This lists your PulseAudio sinks. Choose the one that matches your speaker. You’ll end up with a long, ugly name like alsa_output.usb-…. In my case it was alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo but yours is different.

webcam

v4l2-ctl –list-devices

Usually index 0 (i.e. /dev/video0).

quick test

export MIC_DEVICE=“plughw:3,0”
export SPK_DEVICE=“alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo”

record -D ”$MIC_DEVICE” -f S16_LE -r 16000 -c 1 -d 3 /tmp/test.wav paplay –device=”$SPK_DEVICE” /tmp/test.wav

When you hear your voice, you’re ready.

Step 6: Run the demo

Make sure the server from step 4 is running, then do the following:

sauce .venv/bin/activate

export MIC_DEVICE=“plughw:3,0”
export SPK_DEVICE=“alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo”
export webcam=0
export voice =“af_Jessica”

python3 Gemma4_vla.py

On first launch, the script downloads Parakeet STT, Kokoro TTS, and generates a voice prompt WAV. It will take about 1 minute. Then it goes live.

SPACE → Start recording Speak your question SPACE → Stop recording

There is also a text-only mode if you want to skip the audio settings and test the LLM path directly.

python3 Gemma4_vla.py –text

change voice

The heart is shipped with many voices. switching:

export voice =“Ampac”
python3 Gemma4_vla.py

Good ones: af_jessica, af_nova, am_puck, bf_emma, am_onyx.

structure

This script exposes only one tool to Gemma 4.

{
“name”: “Look and answer.”,
“explanation”: “Take a photo with your webcam and analyze what you see.”
}

When asking a question:

Your speech is transcribed locally (Parakeet STT) Gemma gets the text and tool definitions If the question requires visuals, call look_and_answer and the script gets the webcam frame and sends it back Gemma answers and Kokoro says it out loud

No matching keywords found. The model determines when you need to check. That’s the VLA part.

llama-server’s –jinja flag enables this and activates Gemma’s native tool invocation support.

troubleshooting

Server is low on memory. Run the cleanup again from step 3. Close everything. This model fits in at 8 GB, but you’ll need to keep it organized.

There is no sound. Check the short sink in the pactl list and make sure the SPK_DEVICE matches the actual sink.

The microphone records silence, double-checks it with arecord -l, then tests the recording manually.

The first run is slow, but normal. Download the model and generate voice prompts. The second one is fast.

environmental variables

Variable Default Description LLAMA_URL http://127.0.0.1:8080/v1/chat/completions llama-server Endpoint MIC_DEVICE plughw:3,0 ALSA Capture Device SPK_DEVICE alsa_output.usb-…analog-stereo PulseAudio Sink for Playback WEBCAM 0 Webcam Index (/dev/videoN) VOICE af_jessica Kokoro TTS Voice

Bonus: Just want to try Gemma 4 in text mode?

If you are not interested in the full VLA demo and would like to try Gemma 4 on Jetson without building anything, the Jetson AI Lab provides a ready-to-use Docker image that includes llama.cpp precompiled for Orin.

sudo docker run -it —rm –pull always \ –runtime=nvidia –network host \ -v $home/.cache/huggingface:/root/.cache/huggingface \ ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \ llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S

One line, no compilation, -hf pulls GGUF from Hugging Face on first run. Visit http://localhost:8080 with an OpenAI compatible client and chat.

Note: This Docker path is text only. The Vision Projector is not loaded and therefore will not work in the VLA demo above. For the full webcam experience, stick with the native build in step 4.

We hope you enjoyed this tutorial! If you have any questions or comments, please feel free to contact us. 🙂

Ashier Arants | NVIDIA

versatileai

See Full Bio

What's Hot

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Most Popular

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Don't Miss

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Subscribe to Updates

What's Hot

Gemma 4 VLA Demo for Jetson Orin Nano Super

get code

hardware

Step 1: System package

Step 2: Python environment

Step 3: Free up RAM (optional but recommended)

add swap

kill the captives of memory

Still low on RAM?

Step 4: Serve Gemma 4

Build llama.cpp

Download the model and vision projector

start the server

Make sure it’s running

Step 5: Find your microphone, speakers, and webcam

microphone

speaker

webcam

quick test

Step 6: Run the demo

change voice

structure

troubleshooting

environmental variables

Bonus: Just want to try Gemma 4 in text mode?

Related Posts