Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Deploying an open source vision language model (VLM) on Jetson

February 24, 2026

Gemini 2.5: A series of thought model updates

February 23, 2026

Il Foglio Case Study | Google Cloud

February 23, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, February 24
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Deploying an open source vision language model (VLM) on Jetson
Tools

Deploying an open source vision language model (VLM) on Jetson

versatileaiBy versatileaiFebruary 24, 2026No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Visual language models (VLMs) represent a major leap forward in AI by blending visual recognition and semantic reasoning. VLM goes beyond traditional models constrained to fixed labels and leverages a collaborative embedding space to interpret and discuss complex, open-ended environments using natural language.

Rapid advances in inference accuracy and efficiency have made these models ideal for edge devices. The NVIDIA Jetson family, from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super, is purpose-built to accelerate applications for physical AI and robotics, delivering the optimized runtimes needed for leading open source models.

This tutorial shows how to use the vLLM framework to deploy NVIDIA Cosmos Reasoning 2B models across the Jetson lineup. We also show how to connect this model to the Live VLM WebUI to enable a real-time webcam-based interface for interactive physics AI.

Prerequisites

Supported devices:

Jetson AGX Thor Developer Kit Jetson AGX Orin (64GB / 32GB) Jetson Orin Super Nano

JetPack version:

JetPack 6 (L4T r36.x) — for Orin devices JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

Weight up to 5 GB for FP8 models and up to 8 GB for vLLM container images

account:

Create an NVIDIA NGC account (free) to download both the model and the vLLM container.

overview

Jetson AGX Thor Jetson AGX Orin Orin Super Nano vLLM container nvcr.io/nvidia/vllm:26.01-py3 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 Model FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) FP8 via NGC (Volume mount) Maximum model length 8192 tokens 8192 tokens 256 tokens (memory constraint) GPU memory utilization 0.8 0.8 0.65

The workflow is the same on both devices.

Download an FP8 model checkpoint via NGC CLI Get a vLLM Docker image for a device Start a container with a model mounted as a volume Connect the Live VLM WebUI to a vLLM endpoint

Step 1: Install NGC CLI

NGC CLI allows you to download model checkpoints from the NVIDIA NGC catalog.

Download and install

mkdir -p ~/Projects/CosmosReasoning cd ~/Projects/CosmosReasoning # Download NGC CLI for ARM64 # Get the latest installer URL from https://org.ngc.nvidia.com/setup/installers/cli wget -O ngccli_arm64.zip. https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip unzip ngccli_arm64.zip chmod u+x ngc-cli/ngc # Add to PATH import PATH=”$PATH:$(pwd)/ngc-cli”

Configure CLI

NGC configuration set

You will be prompted to:

API key — Generate in NGC API key settings CLI output format — Choose json or ascii org — Press Enter to accept default

Step 2: Download the model

Download the FP8 quantization checkpoint. This is used by all Jetson devices.

cd ~/Projects/CosmosReasoning ngc registry model download version “nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8”

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/ containing the model weights. Make a note of the full path. Mount this as a volume in your Docker container.

Step 3: Pull the vLLM Docker image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Deliver Cosmos Reasoning 2B using vLLM

Option A: Jetson AGX Thor

Thor has enough GPU memory to run models with sufficient context length.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

Start the container with the model mounted.

docker run –rm -it \ –runtime nvidia \ –network host \ –ipc host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ nvcr.io/nvidia/vllm:26.01-py3 \bash

Activate the environment within the container and serve the model.

vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8

Note: The –reasoning-parser qwen3 flag enables thought chain reasoning extraction. The –media-io-kwargs flag configures processing of video frames.

Wait until you see the following content:

Info: Uvicorn running at http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the same generous parameters as Thor.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

1. Start the container.

docker run –rm -it \ –runtime nvidia \ –network host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04\bash

2. Activate the environment within the container and provide:

cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –max-model-len 8192 \ –media-io-kwargs ‘{“video”: {“num_frames”: -1}}’ \ –reasoning-parser qwen3 \ –gpu-memory-utilization 0.8

Wait until you see the following content:

Info: Uvicorn running at http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (with memory limitations)

Orin Super Nano has significantly less RAM and requires aggressive memory optimization flags.

Set the path to the downloaded model and free up cache memory on the host.

MODEL_PATH=”$HOME/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8″ sudo sysctl -w vm.drop_caches=3

1. Start the container.

docker run –rm -it \ –runtime nvidia \ –network host \ -v “$MODEL_PATH:/models/cosmos-reason2-2b:ro” \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \ ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04\bash

2. Activate the environment within the container and provide:

cd /opt/source venv/bin/activate vllmserve /models/cosmos-reason2-2b \ –host 0.0.0.0 \ –port 8000 \ –trust-remote-code \ –enforce-eager \ –max-model-len 256 \ –max-num-batched-tokens 256 \ –gpu-memory-utilization 0.65 \ –max-num-seqs 1 \ –enable-chunked-prefill \ –limit-mm-per-prompt ‘{“image”:1,”video”:1}’ \ –mm-processor-kwargs ‘{“num_frames”:2,”max_pixels”:150528}’

Key flag descriptions (Orin Super Nano only):

Flag Purpose –enforce-eager Disable CUDA graph to save memory –max-model-len 256 Limit context to fit in available memory –max-num-batched-tokens 256 Match model length limit –gpu-memory-utilization 0.65 Reserve headroom for system processes –max-num-seqs 1 1 at a time to minimize memory –enable-chunked-prefill Process prefill in chunks for memory efficiency –limit-mm-per-prompt Limit to one image and one video per prompt –mm-processor-kwargs Reduce video frame and image resolution –VLLM_SKIP_WARMUP=true Skip warmup to save time and memory

Wait until the server is confirmed to be ready.

Info: Uvicorn running at http://0.0.0.0:8000

Make sure the server is running

From another Jetson device:

curl http://localhost:8000/v1/models

You can see the model listed in the response.

Step 5: Test with a quick API call

Make sure your model responds correctly before connecting to the WebUI.

curl -s http://localhost:8000/v1/chat/completions \ -H “Content-Type: application/json” \ -d ‘{ “model”: “/models/cosmos-reason2-2b”, “messages”: ( { “role”: “user”, “content”: “What features does it have?” } ), “max_tokens”: 128 }’ | python3 -m json.tool

Tip: The model name used in the API request must match what vLLM reports. Check with curl http://localhost:8000/v1/models.

Step 6: Connect to Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. vLLM with Cosmos Reasoning 2B lets you stream your webcam and get live AI analysis with inference.

Install the live VLM WebUI

The easiest way is pip (open another terminal).

curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env cd ~/Projects/CosmosReasoning uv venv .live-vlm –python 3.12 source .live-vlm/bin/activate uv pip install live-vlm-webui live-vlm-webui

Or use Docker.

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git cd live-vlm-webui ./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090 in your browser. Accept the self-signed certificate (click Advanced → Continue). In the (VLM API Configuration) section of the left sidebar, configure the following: Set the API base URL to http://localhost:8000/v1. Click the (Update) button to detect the model. Select the Cosmos Reasoning 2B model from the dropdown. Select the camera and click (Start).

The WebUI streams webcam frames to Cosmos Reasoning 2B and displays the model’s analysis in real time.

Orin Recommended WebUI Settings

Orin runs with a shorter context length, so adjust the following settings in the WebUI:

Max tokens: set to 100-150 (shorter responses complete faster) Frame processing interval: set to 60+ (gives model time between frames)

troubleshooting

Orin is running out of memory

Issue: vLLM crashes with CUDA out of memory error.

Solved:

Free up system memory before starting.

sudo sysctl -w vm.drop_caches=3

–lower gpu-memory-utilization (try 0.55 or 0.50)

–max-model-len further reduced (try 128)

Make sure no other GPU-intensive processes are running

Model not found in WebUI

Issue: Models are not displayed in the Live VLM WebUI dropdown.

Solved:

Verify that vLLM is running. curl http://localhost:8000/v1/models Ensure that the WebUI API base URL is set to http://localhost:8000/v1 (not https). If vLLM and WebUI are in separate containers, use http://:8000/v1 instead of localhost.

Orin’s slow reasoning

Problem: Each response takes a very long time.

Solved:

This is expected in memory-constrained configurations. Cosmos Reasoning 2B FP8 on Orin prioritizes memory fit over speed Reduce max_tokens in the WebUI to get shorter, faster responses Increase frame interval to prevent the model from constantly processing new frames

vLLM fails to load model

Issue: vLLM reports that the model path does not exist or cannot be loaded.

Solved:

Verify that the NGC download completed successfully. ls ~/Projects/CosmosReasoning/cosmos-reason2-2b_v1208-fp8-static-kv8/ Verify that the volume mount path is correct with the docker run command. Ensure that the model directory is mounted as read-only (:ro) and that the path in the container matches what you pass to vllmserve.

summary

This tutorial demonstrated how to use vLLM to deploy an NVIDIA Cosmos Reasoning 2B model to the Jetson family of devices.

Cosmos Reasoning 2B’s chain of thought capabilities combined with the real-time streaming of Live VLM WebUI make it ideal for prototyping and evaluating vision AI applications at the edge.

additional resources

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGemini 2.5: A series of thought model updates
versatileai

Related Posts

Tools

Gemini 2.5: A series of thought model updates

February 23, 2026
Tools

World’s largest dairy cooperative builds AI dairy platform based on 50 years of data

February 23, 2026
Tools

Expanding AI in Science and Education — Google DeepMind

February 22, 2026
Add A Comment

Comments are closed.

Top Posts

How financial institutions are incorporating AI decision-making

February 18, 20265 Views

Lawmakers ask GAO to review state and federal AI regulations

February 19, 20264 Views

CIO’s Governance Guide

January 22, 20264 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

How financial institutions are incorporating AI decision-making

February 18, 20265 Views

Lawmakers ask GAO to review state and federal AI regulations

February 19, 20264 Views

CIO’s Governance Guide

January 22, 20264 Views
Don't Miss

Deploying an open source vision language model (VLM) on Jetson

February 24, 2026

Gemini 2.5: A series of thought model updates

February 23, 2026

Il Foglio Case Study | Google Cloud

February 23, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?