When you talk to Gemma 4, she will decide for herself whether she needs to look through her webcam to answer you. Everything runs on a local Jetson Orin Nano Super. You speak → Parakeet STT → Gemma 4 → (webcam if necessary) → Kokoro TTS → Speaker
Press SPACE to record, press SPACE again to stop. This is a simple VLA. The model makes its own decisions about what to do based on the context of what is being asked, without using keyword triggers or hard-coded logic. If your question requires Gemma to open her eyes, she will take the photo, interpret it, and answer with that context in mind. She’s not explaining the photos, she’s using what she sees to answer real questions.
to be honest? It’s pretty impressive to see this run on the Jetson Orin Nano. 🙂
get code
The complete script for this tutorial can be found in my Google_Gemma repository on GitHub next to the Gemma 2 demo.
👉 github.com/asierarranz/Google_Gemma
Obtain using one of the following (select one):
git clone https://github.com/asierarranz/Google_Gemma.git
CD Google_Gemma/Gemma4 wget https://raw.githubusercontent.com/asierarranz/Google_Gemma/main/Gemma4/Gemma4_vla.py
All you need is that one file (Gemma4_vla.py). Get the STT/TTS model and audio assets from Hugging Face on the first run.
hardware
What we used:
NVIDIA Jetson Orin Nano Super (8 GB) Logitech C920 Webcam (with built-in microphone) USB speakers USB keyboard (press SPACE)
Any webcam, USB microphone, and USB speaker that Linux recognizes should work, including but not limited to these devices.
Step 1: System package
If you’re new to Jetson, let’s install the basics.
sudo apt update sudo apt install -y \ git build-essential cmakecurl wget pkg-config \ python3-pip python3-venv python3-dev \ alsa-utilspulseaudio-utils v4l-utils psmisc \ ffmpeg libsndfile1
build-essential and cmake are only required if you use the native llama.cpp route (Option A in Step 4). The rest are for audio, webcam, and Python.
Step 2: Python environment
python3 -m venv .venv
sauce .venv/bin/activate pip install –upgrade pip pip install opencv-python-headless onnx_asr kokoro-onnx sound file hugging face hub numpy
Step 3: Free up RAM (optional but recommended)
Note: This step may not be necessary. But we’re pushing this 8 GB board pretty hard on a fairly high-performance model, so giving it some headroom will make the whole experience smoother, especially if you’ve been using Docker or other heavy stuff before.
These are the only commands that worked well for me. Please use it if it helps.
add swap
Although swapping doesn’t speed up inference, it does act as a safety net during model loading so that OOM doesn’t get killed at the worst possible moment.
sudo fallocate -l 8G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile
echo ‘/swapfile none swap sw 0 0’ |Sudo tea -a /etc/fstab
kill the captives of memory
sudo systemctl stop docker 2>/dev/null || truth
sudo systemctl stopcontainerd 2>/dev/null || truth
pkill -f tracker-miner-fs-3 || truth
pkill -f gnome software || truth
Free -h
Close everything you don’t need, including browser tabs and IDE windows. Every MB counts.
If you go the Docker route in step 4, obviously don’t stop Docker here. You will need it. Still, I’ll kill the rest.
Still low on RAM?
According to our testing, Q4_K_M (native build) and Q4_K_S (Docker) work well on 8 GB boards after completing the above cleanup. But if there’s something else you can’t get rid of and you’re still low on memory, you can step down to the Q3 Quant, which is the same model but a little less sleek, but noticeably lighter. Just swap the filenames in step 4.
gemma-4-E2B-it-Q3_K_M.gguf # instead of Q4_K_M
But honestly, if you can, stick with Q4_K_M. That’s the sweet spot.
Step 4: Serve Gemma 4
Before launching the demo, you need to run llama-server on Gemma 4. Build llama.cpp natively on Jetson. This gives you the best performance and full control over your vision projector, which is required for VLA demos.
Build llama.cpp
CD ~ Git clone https://github.com/ggml-org/llama.cpp.git
CD llama.cpp cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=“87” \ -DGGML_NATIVE=ON \ -DCMAKE_BUILD_TYPE=cmake –build build –config Release -j4
Download the model and vision projector
mkdir -p ~/model&& CD ~/models wget -O gemma-4-E2B-it-Q4_K_M.gguf \ https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf wget -O mmproj-gemma4-e2b-f16.gguf \ https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf
The mmproj file is a vision projector. Gemma will be blind without this, so don’t skip it.
start the server
~/llama.cpp/build/bin/llama-server \ -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \ –mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \ -c 2048 \ –image-min-tokens 70 –image-max-tokens 70 \ –ubatch-size 512 –batch-size 512 \ –host 0.0.0.0 –port 8080 \ -ngl 99 –flash-attn on \ –no-mmproj-offload –jinja -np 1
One flag worth mentioning: -ngl 99 tells llama-server to push all layers of the model to the GPU (99 is just “as many as the model has”). If you run into memory issues, you can lower that number and offload fewer layers to the GPU and the rest to the CPU. However, with this setup, all layers on the GPU should work fine.
Make sure it’s running
From another device:
curl -s http://localhost:8080/v1/chat/completions \ -H “Content type: application/json” \ -d ‘{“model”:”gemma4″,”message”:({“role”:”user”,”content”:”Hello!”}),”max_tokens”:32}’ \ | python3 -m json.tool
If JSON is returned, you’re good to go.
Step 5: Find your microphone, speakers, and webcam
microphone
record -l
Find your USB microphone. In our case, the C920 was displayed as plughw:3,0.
speaker
pactl list short sink
This lists your PulseAudio sinks. Choose the one that matches your speaker. You’ll end up with a long, ugly name like alsa_output.usb-…. In my case it was alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo but yours is different.
webcam
v4l2-ctl –list-devices
Usually index 0 (i.e. /dev/video0).
quick test
export MIC_DEVICE=“plughw:3,0”
export SPK_DEVICE=“alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo”
record -D ”$MIC_DEVICE” -f S16_LE -r 16000 -c 1 -d 3 /tmp/test.wav paplay –device=”$SPK_DEVICE” /tmp/test.wav
When you hear your voice, you’re ready.
Step 6: Run the demo
Make sure the server from step 4 is running, then do the following:
sauce .venv/bin/activate
export MIC_DEVICE=“plughw:3,0”
export SPK_DEVICE=“alsa_output.usb-Generic_USB2.0_Device_20130100ph0-00.analog-stereo”
export webcam=0
export voice =“af_Jessica”
python3 Gemma4_vla.py
On first launch, the script downloads Parakeet STT, Kokoro TTS, and generates a voice prompt WAV. It will take about 1 minute. Then it goes live.
SPACE → Start recording Speak your question SPACE → Stop recording
There is also a text-only mode if you want to skip the audio settings and test the LLM path directly.
python3 Gemma4_vla.py –text
change voice
The heart is shipped with many voices. switching:
export voice =“Ampac”
python3 Gemma4_vla.py
Good ones: af_jessica, af_nova, am_puck, bf_emma, am_onyx.
structure
This script exposes only one tool to Gemma 4.
{
“name”: “Look and answer.”,
“explanation”: “Take a photo with your webcam and analyze what you see.”
}
When asking a question:
Your speech is transcribed locally (Parakeet STT) Gemma gets the text and tool definitions If the question requires visuals, call look_and_answer and the script gets the webcam frame and sends it back Gemma answers and Kokoro says it out loud
No matching keywords found. The model determines when you need to check. That’s the VLA part.
llama-server’s –jinja flag enables this and activates Gemma’s native tool invocation support.
troubleshooting
Server is low on memory. Run the cleanup again from step 3. Close everything. This model fits in at 8 GB, but you’ll need to keep it organized.
There is no sound. Check the short sink in the pactl list and make sure the SPK_DEVICE matches the actual sink.
The microphone records silence, double-checks it with arecord -l, then tests the recording manually.
The first run is slow, but normal. Download the model and generate voice prompts. The second one is fast.
environmental variables
Variable Default Description LLAMA_URL http://127.0.0.1:8080/v1/chat/completions llama-server Endpoint MIC_DEVICE plughw:3,0 ALSA Capture Device SPK_DEVICE alsa_output.usb-…analog-stereo PulseAudio Sink for Playback WEBCAM 0 Webcam Index (/dev/videoN) VOICE af_jessica Kokoro TTS Voice
Bonus: Just want to try Gemma 4 in text mode?
If you are not interested in the full VLA demo and would like to try Gemma 4 on Jetson without building anything, the Jetson AI Lab provides a ready-to-use Docker image that includes llama.cpp precompiled for Orin.
sudo docker run -it —rm –pull always \ –runtime=nvidia –network host \ -v $home/.cache/huggingface:/root/.cache/huggingface \ ghcr.io/nvidia-ai-iot/llama_cpp:latest-jetson-orin \ llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S
One line, no compilation, -hf pulls GGUF from Hugging Face on first run. Visit http://localhost:8080 with an OpenAI compatible client and chat.
Note: This Docker path is text only. The Vision Projector is not loaded and therefore will not work in the VLA demo above. For the full webcam experience, stick with the native build in step 4.
We hope you enjoyed this tutorial! If you have any questions or comments, please feel free to contact us. 🙂
Ashier Arants | NVIDIA

