As the capabilities of large-scale language models (LLMs) have improved, a new class of models has emerged: vision language models (VLMs). These models can analyze images and videos to describe scenes, create captions, and answer questions about visual content.
Running AI models on your own device can be difficult because these models are often computationally demanding, but it also offers significant benefits, such as increased privacy because the data remains on your machine, and increased speed and reliability because it doesn’t rely on an internet connection or external server. This is where tools like Optimum Intel and OpenVINO come in, along with smaller, more efficient models like SmolVLM. This blog post provides three simple steps to run VLM locally without the need for expensive hardware or GPUs (although all code samples in this blog post can be run on Intel GPUs).
Deploy a model using Optimum
Small-scale models like SmolVLM are built to consume less resources, but can be further optimized. In this blog post, we will show you how to optimize your model to reduce memory usage, speed up inference, and make deployment on resource-constrained devices more efficient.
To follow this tutorial you will need to have optimal and openvino installed. This can be done like this:
pip install optimal Intel (openvino) transformer==4.52.*
Step 1: Convert the model
First, you need to convert your model to OpenVINO IR. There are multiple options to do this.
You can use optimum-cli from Optimum CLI. export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct smolvlm_ov/ Alternatively, you can convert on the fly when loading the model.
from best intel import OVModelForVisualCausalLM Model ID = “HuggingFaceTB/SmolVLM2-256M-Video-Instruction”
Model = OVModelForVisualCausalLM.from_pretrained(model_id) model.save_pretrained(“smolvlm_ov”)
Step 2: Quantization
Next, optimize the model. Quantization reduces the precision of model weights and activations, making the model smaller and faster. Essentially, this is a way to map values from high-precision data types, such as 32-bit floating point numbers (FP32), to lower-precision formats, typically 8-bit integers (INT8). Although this process has some important benefits, it can also have an impact on the potential loss of accuracy.
Optimum supports two main post-training quantization methods:
Let’s examine each one.
Option 1: Quantization of weights only
Weight-only quantization means that only the weights are quantized, and the activations remain at their original precision. The result is smaller, more memory efficient models, and faster load times. However, since activations are not quantized, there is a limit to the improvement in inference speed. Weight-only quantization is an easy first step because it usually does not result in significant accuracy loss.
Starting with OpenVINO 2024.3, if a model’s weights are quantized, the corresponding activations are also quantized at runtime, making them even faster depending on the device.
To do this, you need to create a quantization configuration OVWeightQuantizationConfig as follows:
from best intel import OVModelForVisualCausalLM, OVWeightQuantizationConfig q_config = OVWeightQuantizationConfig(bits=8) q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) q_model.save_pretrained(“smolvlm_int8”)
or the equivalent CLI.
optimal-cli export openvino -m HuggingFaceTB/SmolVLM2-256M-Video-Instruct –weight-format int8 smolvlm_int8/
Option 2: Static quantization
In static quantization, both weights and activations are quantized before inference. Perform a calibration step to achieve an optimal estimate of the activation quantization parameters. In this step, a small representative dataset is fed to the model. In our case, we use 50 samples of the context dataset and apply static quantization to the vision encoder and weight-only quantization to the rest of the model. Experiments show that applying static quantization to the vision encoder provides significant performance improvements without significant accuracy loss. Because the vision encoder is called only once per generation, the overall performance gain from applying static quantization to this component is lower than the performance gain achieved by optimizing more frequently used components, such as language models. However, this approach may be beneficial in certain scenarios. For example, when you want a short answer, especially when using multiple images as input.
from best intel import OVModelForVisualCausalLM, OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig q_config = OVPipelineQuantizationConfig( quantization_configs={
“lm_model”: OVWeightQuantizationConfig(bit=8),
“Text embedding model”: OVWeightQuantizationConfig(bit=8),
“vision_embeddings_model”: OVQuantizationConfig(bit=8), }, dataset=Dataset, num_samples=num_samples, ) q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=q_config) q_model.save_pretrained(“smolvlm_static_int8”)
Quantizing activations adds small errors that can accumulate and affect accuracy, so careful subsequent testing is important. See the documentation for details and examples.
Step 3: Perform inference
Now you can perform inference using the quantized model.
generated_ids = q_model.generate(**inputs, max_new_tokens=100) generated_texts =processor.batch_decode(generated_ids, Skip_special_tokens=truth)
print(Generated text (0))
If you have a recent Intel laptop, Intel AI PC, or Intel discrete GPU, you can load the model onto the GPU by adding device=”gpu” when loading the model.
Model = OVModelForVisualCausalLM.from_pretrained(model_id, device=“GPU”)
We also created a space where you can play with the original model and the quantized variants obtained by applying weight-only and mixed quantization, respectively. This demo runs on a 4th generation Intel Xeon (Sapphire Rapids) processor.
Please check your notebook to reproduce the results.
Evaluation and conclusion
We ran benchmarks to compare the performance of PyTorch, OpenVINO, and OpenVINO 8-bit WOQ versions of the original model. The goal was to evaluate the impact of weight-only quantization on latency and throughput on Intel CPU hardware. This test used a single image as input.
To evaluate the performance of our model, we measured the following metrics:
Time to first token (TTFT): The time it takes to generate the first output token. Time per output token (TPOT): The time it takes to generate each subsequent output token. End-to-End Latency: The total time it takes to generate output for all output tokens. Decoding throughput: The number of tokens per second that the model produces during the decoding phase.
Here are the results for Intel CPU:
Setup time to first token (TTFT) Time per output token (TPOT) End-to-end delay Decoding throughput pytorch 5.150 1.385 25.927 0.722 openvino 0.420 0.021 0.738 47.237 openvino-8bit-woq 0.247 0.016 0.482 63.928
This benchmark shows how a small, optimized multimodal model, such as the SmolVLM2-256M, performs on various configurations of Intel CPUs. Testing showed that the PyTorch version exhibited high latency, with a time-to-first token (TTFT) of >5 seconds and a decoding throughput of 0.7 tokens/second. Simply converting your model in Optimum and running it in OpenVINO significantly reduces time to first token (TTFT) to 0.42 seconds (x12 speedup) and increases throughput to 47 tokens/second (x65). Applying 8-bit weight-only quantization further reduces the TTFT (x1.7) and increases throughput (x1.4) while reducing model size and improving efficiency.
Platform configuration Platform configuration for the above performance:
System board: MSI B860M GAMING PLUS WIFI (MS-7E42)
CPU: Intel® Core™ Ultra 7 265K
Socket/Physical Core: 1/20 (20 threads)
Hyper-Threading/Turbo Settings: Disabled
Memory: 64GB DDR5 @ 6400MHz
TDP: 665W
BIOS: American Megatrends International, LLC. 2.A10
BIOS release date: November 28, 2024
OS: Ubuntu 24.10
Kernel: 6.11.0–25-generic
OpenVINO version: 2025.2.0
Torch: 2.8.0
Torch Vision: 0.23.0+CPU
Optimal Intel: 1.25.2
Transformer: 4.53.3
Benchmark date: May 15, 2025
Benchmarked by: Intel Corporation Performance may vary depending on usage, configuration, and other factors. See platform configurations below.
Useful links and resources
Notice and Disclaimer
Performance varies depending on usage, configuration, and other factors. For more information, please visit the Performance Index site. Performance results are based on testing as of the date indicated in the configuration and may not reflect all publicly available updates. For configuration details, see Backup. No product or component is absolutely safe. Costs and results may vary. Intel technologies may require valid hardware, software, or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

