Visual language models (VLMs) represent a major leap forward in AI by blending visual recognition and semantic reasoning. VLM goes beyond traditional models constrained to fixed labels and leverages a collaborative embedding space to interpret and discuss complex, open-ended environments using natural language.
Rapid advances in inference accuracy and efficiency have made these models ideal for edge devices. The NVIDIA Jetson family, from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super, is purpose-built to accelerate applications for physical AI and robotics, delivering the optimized runtimes needed for leading open source models.
This tutorial shows how to use the vLLM framework to deploy NVIDIA Cosmos Reasoning 2B models across the Jetson lineup. We also show how to connect this model to the Live VLM WebUI to enable a real-time webcam-based interface for interactive physics AI.
Prerequisites
Supported devices:
Jetson AGX Thor Developer Kit Jetson AGX Orin (64GB / 32GB) Jetson Orin Super Nano
JetPack version:
JetPack 6 (L4T r36.x) — for Orin devices JetPack 7 (L4T r38.x) — for Thor
Storage: NVMe SSD required
Weight up to 5 GB for FP8 models and up to 8 GB for vLLM container images
account:
Create an NVIDIA NGC account (free) to download both the model and the vLLM container.
overview
Jetson AGX Thor Jetson AGX Orin Orin Super Nano vLLM container nvcr.io/nvidia/vllm:26.01-py3 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 Model FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) Maximum model length 8192 tokens 8192 tokens 256 tokens (memory constraint) GPU memory utilization 0.8 0.8 0.65
The workflow is the same on both devices.
Download an FP8 model checkpoint via NGC CLI Get a vLLM Docker image for a device Start a container with a model mounted as a volume Connect the Live VLM WebUI to a vLLM endpoint
Step 1: Install NGC CLI
NGC CLI allows you to download model checkpoints from the NVIDIA NGC catalog.
Download and install
mkdir -p ~/Projects/CosmosReasoning cd ~/Projects/CosmosReasoning # Download NGC CLI for ARM64 # Get the latest installer URL from https://org.ngc.nvidia.com/setup/installers/cli wget -O ngccli_arm64.zip. https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip unzip ngccli_arm64.zip chmod u+x ngc-cli/ngc # Add to PATH import PATH=”$PATH:$(pwd)/ngc-cli”
Configure CLI
NGC configuration set
You will be prompted to:
API key — Generate in NGC API key settings CLI output format — Choose json or ascii org — Press Enter to accept default
Step 2: Download the model
Download the FP8 quantization checkpoint. This is used by all Jetson devices.
cd ~/Projects/CosmosReasoning ngc registry model download version “nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8”
This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/ containing the model weights. Make a note of the full path. Mount this as a volume in your Docker container.
Step 3: Pull the vLLM Docker image
For Jetson AGX Thor
docker pull nvcr.io/nvidia/vllm:26.01-py3
For Jetson AGX Orin / Orin Super Nano
docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04
Step 4: Deliver Cosmos Reasoning 2B using vLLM
Option A: Jetson AGX Thor
Thor has enough GPU memory to run models with sufficient context length.
Set the path to the downloaded model and free up cache memory on the host.
MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3
Start the container with the model mounted.
docker run –rm -it \ –runtime nvidia \ –network host \ –ipc host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvcr.io/nvidia/vllm:26.01-py3 \bash
Activate the environment within the container and serve the model.
vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8
Note: The –reasoning-parser qwen3 flag enables thought chain reasoning extraction. The –media-io-kwargs flag configures processing of video frames.
Wait until you see the following content:
Info: Uvicorn running at http://0.0.0.0:8000
Option B: Jetson AGX Orin
AGX Orin has enough memory to run the model with the same generous parameters as Thor.
Set the path to the downloaded model and free up cache memory on the host.
MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3
1. Start the container.
docker run –rm -it \ –runtime nvidia \ –network host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04\bash
2. Activate the environment within the container and provide:
cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8
Wait until you see the following content:
Info: Uvicorn running at http://0.0.0.0:8000
Option C: Jetson Orin Super Nano (with memory limitations)
Orin Super Nano has significantly less RAM and requires aggressive memory optimization flags.
Set the path to the downloaded model and free up cache memory on the host.
MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3
1. Start the container.
docker run –rm -it \ –runtime nvidia \ –network host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04\bash
2. Activate the environment within the container and provide:
cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –host 0.0.0.0 \ –port 8000 \ –trust-remote-code \ –enforce-eager \ –max-model-len 256 \ –max-num-batched-tokens 256 \ –gpu-memory-utilization 0.65 \ –max-num-seqs 1 \ –enable-chunked-prefill \ –limit-mm-per-prompt ‘{“image”:1,”video”:1}’ \ –mm-processor-kwargs ‘{“num_frames”:2,”max_pixels”:150528}’
Key flag descriptions (Orin Super Nano only):
Flag Purpose –enforce-eager Disable CUDA graph to save memory –max-model-len 256 Limit context to fit in available memory –max-num-batched-tokens 256 Match model length limit –gpu-memory-utilization 0.65 Reserve headroom for system processes –max-num-seqs 1 1 at a time to minimize memory –enable-chunked-prefill Process prefill in chunks for memory efficiency –limit-mm-per-prompt Limit to one image and one video per prompt –mm-processor-kwargs Reduce video frame and image resolution –VLLM_SKIP_WARMUP=true Skip warmup to save time and memory
Wait until the server is confirmed to be ready.
Info: Uvicorn running at http://0.0.0.0:8000
Make sure the server is running
From another Jetson device:
curl http://localhost:8000/v1/models
You can see the model listed in the response.
Step 5: Test with a quick API call
Make sure your model responds correctly before connecting to the WebUI.
curl -s http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “/models/cosmos-reason2-2b”, “messages”: ( { “role”: “user”, “content”: “What features does it have?” } ), “max_tokens”: 128 }’ | python3 -m json.tool
Tip: The model name used in the API request must match what vLLM reports. Check with curl http://localhost:8000/v1/models.
Step 6: Connect to Live VLM WebUI
Live VLM WebUI provides a real-time webcam-to-VLM interface. vLLM with Cosmos Reasoning 2B lets you stream your webcam and get live AI analysis with inference.
Install the live VLM WebUI
The easiest way is pip (open another terminal).
curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env cd ~/Projects/CosmosReasoning uv venv .live-vlm –python 3.12 source .live-vlm/bin/activate uv pip install live-vlm-webui live-vlm-webui
Or use Docker.
git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git cd live-vlm-webui ./scripts/start_container.sh
Configure the WebUI
Open https://localhost:8090 in your browser. Accept the self-signed certificate (click Advanced → Continue). In the (VLM API Configuration) section of the left sidebar, configure the following: Set the API base URL to http://localhost:8000/v1. Click the (Update) button to detect the model. Select the Cosmos Reasoning 2B model from the dropdown. Select the camera and click (Start).
The WebUI streams webcam frames to Cosmos Reasoning 2B and displays the model’s analysis in real time.
Orin Recommended WebUI Settings
Orin runs with a shorter context length, so adjust the following settings in the WebUI:
Max tokens: set to 100-150 (shorter responses complete faster) Frame processing interval: set to 60+ (gives model time between frames)
troubleshooting
Orin is running out of memory
Issue: vLLM crashes with CUDA out of memory error.
Solved:
Free up system memory before starting.
sudo sysctl -w vm.drop_caches=3
–lower gpu-memory-utilization (try 0.55 or 0.50)
–max-model-len further reduced (try 128)
Make sure no other GPU-intensive processes are running
Model not found in WebUI
Issue: Models are not displayed in the Live VLM WebUI dropdown.
Solved:
Verify that vLLM is running. curl http://localhost:8000/v1/models Ensure that the WebUI API base URL is set to http://localhost:8000/v1 (not https). If vLLM and WebUI are in separate containers, use http://:8000/v1 instead of localhost.
Orin’s slow reasoning
Problem: Each response takes a very long time.
Solved:
This is expected in memory-constrained configurations. Cosmos Reasoning 2B FP8 on Orin prioritizes memory fit over speed Reduce max_tokens in the WebUI to get shorter, faster responses Increase frame interval to prevent the model from constantly processing new frames
vLLM fails to load model
Issue: vLLM reports that the model path does not exist or cannot be loaded.
Solved:
Verify that the NGC download completed successfully. ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/ Verify that the volume mount path is correct with the docker run command. Ensure that the model directory is mounted as read-only (:ro) and that the path in the container matches what you pass to vllmserve.
summary
This tutorial demonstrated how to use vLLM to deploy an NVIDIA Cosmos Reasoning 2B model to the Jetson family of devices.
Cosmos Reasoning 2B’s chain of thought capabilities combined with the real-time streaming of Live VLM WebUI make it ideal for prototyping and evaluating vision AI applications at the edge.
additional resources

