Deploying an open source vision language model (VLM) on Jetson

Visual language models (VLMs) represent a major leap forward in AI by blending visual recognition and semantic reasoning. VLM goes beyond traditional models constrained to fixed labels and leverages a collaborative embedding space to interpret and discuss complex, open-ended environments using natural language.

Rapid advances in inference accuracy and efficiency have made these models ideal for edge devices. The NVIDIA Jetson family, from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super, is purpose-built to accelerate applications for physical AI and robotics, delivering the optimized runtimes needed for leading open source models.

This tutorial shows how to use the vLLM framework to deploy NVIDIA Cosmos Reasoning 2B models across the Jetson lineup. We also show how to connect this model to the Live VLM WebUI to enable a real-time webcam-based interface for interactive physics AI.

Prerequisites

Supported devices:

Jetson AGX Thor Developer Kit Jetson AGX Orin (64GB / 32GB) Jetson Orin Super Nano

JetPack version:

JetPack 6 (L4T r36.x) — for Orin devices JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

Weight up to 5 GB for FP8 models and up to 8 GB for vLLM container images

account:

Create an NVIDIA NGC account (free) to download both the model and the vLLM container.

overview

Jetson AGX Thor Jetson AGX Orin Orin Super Nano vLLM container nvcr.io/nvidia/vllm:26.01-py3 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 Model FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) Maximum model length 8192 tokens 8192 tokens 256 tokens (memory constraint) GPU memory utilization 0.8 0.8 0.65

The workflow is the same on both devices.

Download an FP8 model checkpoint via NGC CLI Get a vLLM Docker image for a device Start a container with a model mounted as a volume Connect the Live VLM WebUI to a vLLM endpoint

Step 1: Install NGC CLI

NGC CLI allows you to download model checkpoints from the NVIDIA NGC catalog.

Download and install

mkdir -p ~/Projects/CosmosReasoning cd ~/Projects/CosmosReasoning # Download NGC CLI for ARM64 # Get the latest installer URL from https://org.ngc.nvidia.com/setup/installers/cli wget -O ngccli_arm64.zip. https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip unzip ngccli_arm64.zip chmod u+x ngc-cli/ngc # Add to PATH import PATH=”$PATH:$(pwd)/ngc-cli”

Configure CLI

NGC configuration set

You will be prompted to:

API key — Generate in NGC API key settings CLI output format — Choose json or ascii org — Press Enter to accept default

Step 2: Download the model

Download the FP8 quantization checkpoint. This is used by all Jetson devices.

cd ~/Projects/CosmosReasoning ngc registry model download version “nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8”

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/ containing the model weights. Make a note of the full path. Mount this as a volume in your Docker container.

Step 3: Pull the vLLM Docker image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Deliver Cosmos Reasoning 2B using vLLM

Option A: Jetson AGX Thor

Thor has enough GPU memory to run models with sufficient context length.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

Start the container with the model mounted.

docker run –rm -it \ –runtime nvidia \ –network host \ –ipc host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvcr.io/nvidia/vllm:26.01-py3 \bash

Activate the environment within the container and serve the model.

vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8

Note: The –reasoning-parser qwen3 flag enables thought chain reasoning extraction. The –media-io-kwargs flag configures processing of video frames.

Wait until you see the following content:

Info: Uvicorn running at http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the same generous parameters as Thor.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

1. Start the container.

docker run –rm -it \ –runtime nvidia \ –network host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04\bash

2. Activate the environment within the container and provide:

cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8

Wait until you see the following content:

Info: Uvicorn running at http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (with memory limitations)

Orin Super Nano has significantly less RAM and requires aggressive memory optimization flags.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

1. Start the container.

2. Activate the environment within the container and provide:

cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –host 0.0.0.0 \ –port 8000 \ –trust-remote-code \ –enforce-eager \ –max-model-len 256 \ –max-num-batched-tokens 256 \ –gpu-memory-utilization 0.65 \ –max-num-seqs 1 \ –enable-chunked-prefill \ –limit-mm-per-prompt ‘{“image”:1,”video”:1}’ \ –mm-processor-kwargs ‘{“num_frames”:2,”max_pixels”:150528}’

Key flag descriptions (Orin Super Nano only):

Flag Purpose –enforce-eager Disable CUDA graph to save memory –max-model-len 256 Limit context to fit in available memory –max-num-batched-tokens 256 Match model length limit –gpu-memory-utilization 0.65 Reserve headroom for system processes –max-num-seqs 1 1 at a time to minimize memory –enable-chunked-prefill Process prefill in chunks for memory efficiency –limit-mm-per-prompt Limit to one image and one video per prompt –mm-processor-kwargs Reduce video frame and image resolution –VLLM_SKIP_WARMUP=true Skip warmup to save time and memory

Wait until the server is confirmed to be ready.

Info: Uvicorn running at http://0.0.0.0:8000

Make sure the server is running

From another Jetson device:

curl http://localhost:8000/v1/models

You can see the model listed in the response.

Step 5: Test with a quick API call

Make sure your model responds correctly before connecting to the WebUI.

curl -s http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “/models/cosmos-reason2-2b”, “messages”: ( { “role”: “user”, “content”: “What features does it have?” } ), “max_tokens”: 128 }’ | python3 -m json.tool

Tip: The model name used in the API request must match what vLLM reports. Check with curl http://localhost:8000/v1/models.

Step 6: Connect to Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. vLLM with Cosmos Reasoning 2B lets you stream your webcam and get live AI analysis with inference.

Install the live VLM WebUI

The easiest way is pip (open another terminal).

curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env cd ~/Projects/CosmosReasoning uv venv .live-vlm –python 3.12 source .live-vlm/bin/activate uv pip install live-vlm-webui live-vlm-webui

Or use Docker.

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git cd live-vlm-webui ./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090 in your browser. Accept the self-signed certificate (click Advanced → Continue). In the (VLM API Configuration) section of the left sidebar, configure the following: Set the API base URL to http://localhost:8000/v1. Click the (Update) button to detect the model. Select the Cosmos Reasoning 2B model from the dropdown. Select the camera and click (Start).

The WebUI streams webcam frames to Cosmos Reasoning 2B and displays the model’s analysis in real time.

Orin Recommended WebUI Settings

Orin runs with a shorter context length, so adjust the following settings in the WebUI:

Max tokens: set to 100-150 (shorter responses complete faster) Frame processing interval: set to 60+ (gives model time between frames)

troubleshooting

Orin is running out of memory

Issue: vLLM crashes with CUDA out of memory error.

Solved:

Free up system memory before starting.

sudo sysctl -w vm.drop_caches=3

–lower gpu-memory-utilization (try 0.55 or 0.50)

–max-model-len further reduced (try 128)

Make sure no other GPU-intensive processes are running

Model not found in WebUI

Issue: Models are not displayed in the Live VLM WebUI dropdown.

Solved:

Verify that vLLM is running. curl http://localhost:8000/v1/models Ensure that the WebUI API base URL is set to http://localhost:8000/v1 (not https). If vLLM and WebUI are in separate containers, use http://:8000/v1 instead of localhost.

Orin’s slow reasoning

Problem: Each response takes a very long time.

Solved:

This is expected in memory-constrained configurations. Cosmos Reasoning 2B FP8 on Orin prioritizes memory fit over speed Reduce max_tokens in the WebUI to get shorter, faster responses Increase frame interval to prevent the model from constantly processing new frames

vLLM fails to load model

Issue: vLLM reports that the model path does not exist or cannot be loaded.

Solved:

Verify that the NGC download completed successfully. ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/ Verify that the volume mount path is correct with the docker run command. Ensure that the model directory is mounted as read-only (:ro) and that the path in the container matches what you pass to vllmserve.

summary

This tutorial demonstrated how to use vLLM to deploy an NVIDIA Cosmos Reasoning 2B model to the Jetson family of devices.

Cosmos Reasoning 2B’s chain of thought capabilities combined with the real-time streaming of Live VLM WebUI make it ideal for prototyping and evaluating vision AI applications at the edge.

additional resources

versatileai

See Full Bio

What's Hot

Deploying an open source vision language model (VLM) on Jetson

Gemini 2.5: A series of thought model updates

Il Foglio Case Study | Google Cloud

Gemini 2.5: A series of thought model updates

World’s largest dairy cooperative builds AI dairy platform based on 50 years of data

Expanding AI in Science and Education — Google DeepMind

How financial institutions are incorporating AI decision-making

Lawmakers ask GAO to review state and federal AI regulations

CIO’s Governance Guide

Most Popular

How financial institutions are incorporating AI decision-making

Lawmakers ask GAO to review state and federal AI regulations

CIO’s Governance Guide

Don't Miss

Deploying an open source vision language model (VLM) on Jetson

Gemini 2.5: A series of thought model updates

Il Foglio Case Study | Google Cloud

Subscribe to Updates

What's Hot

Deploying an open source vision language model (VLM) on Jetson

Prerequisites

overview

Step 1: Install NGC CLI

Download and install

Configure CLI

Step 2: Download the model

Step 3: Pull the vLLM Docker image

For Jetson AGX Thor

For Jetson AGX Orin / Orin Super Nano

Step 4: Deliver Cosmos Reasoning 2B using vLLM

Option A: Jetson AGX Thor

Option B: Jetson AGX Orin

Option C: Jetson Orin Super Nano (with memory limitations)

Make sure the server is running

Step 5: Test with a quick API call

Step 6: Connect to Live VLM WebUI

Install the live VLM WebUI

Configure the WebUI

Orin Recommended WebUI Settings

troubleshooting

Orin is running out of memory

Model not found in WebUI

Orin’s slow reasoning

vLLM fails to load model

summary

additional resources

Related Posts