Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

AI Art Generation Using Primo Models: Converting Digital Illustrations for Creators | AI News Details

July 1, 2025

Silicon Valley Insider is revealing AI companies like cults

July 1, 2025

Unlocking conversion of web screenshots to HTML code using WebSight dataset

July 1, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, July 1
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Optimum-Optimize and deploy with Intel and OpenVino Genai
Tools

Optimum-Optimize and deploy with Intel and OpenVino Genai

By February 26, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Deploying transformer models on the edge or client side requires careful consideration of performance and compatibility. Python is powerful, but is not always ideal for such a deployment, especially in C++-dominated environments. In this blog, we use Optimum-Intel and OpenVino™ Genai to optimize and deploy hugging face transformer models to ensure efficient AI inference with minimal dependencies.

table of contents

Why use OpenVino™ for edge deployment Step 1: Setup your environment Step 2: Export your model to OpenVino IR Step 3: Optimize your model Step 4: Deploy with OpenVino Genai API conclusion

Why use OpenVino™ for Edge Deployment?

OpenVino™ was originally developed as a C++ AI inference solution, making it ideal for edge and client deployments where minimizing dependencies is critical. The introduction of the Genai API makes it even easier to integrate large-scale language models (LLMS) into C++ or Python applications, designed to simplify deployment and enhance performance.

Step 1: Set up your environment

Prerequisites

To get started, make sure your environment is properly configured in both Python and C++. Install the required Python packages.

PIP Install – Upgrade – Upgrade – Get your strategy enthusiastic “Optimum (OpenVino)”

The specific packages used in this blog post are:

Transformer== 4.44 OpenVino == 24.3 OpenVino-Tokenizers == 24.3 Optimum-Intel == 1.20 LM-Eval == 0.4.3

Follow these instructions to install the Genai C++ library.

Step 2: Export the model to OpenVino IR

We embraced the collaboration between Face and Intel, leading to the best Intel project. It is designed to optimize transformer models for Intel HW inference. Optimum-Intel supports OpenVino as an inference backend, and its API includes wrappers for various model architectures built on top of the OpenVino inference API. All of these wrappers start with an OV prefix, such as ovmodelforcausallm. Otherwise it’s similar to the API of the 🤗Trans Library.

Two options are available to export a trans model to an OpenVino intermediate representation (IR). This can be done using Python’s .from_pretrained() method or the best command line interface (CLI). Below is an example using both methods.

Use the Python API

from optimum.intel Import ovmodelforcausallm model_id = “Metalama/Metalama-3.1-8B”
Model = ovmodelforcausallm.from_pretrained(model_id, export =truth)model.save_pretrained(“./llama-3.1-8b-ov”))

Using the Command Line Interface (CLI)

Optimum-Cli export OpenVino -M Meta-lama/Meta-llama-3.1-8b ./llama-3.1-8b-ov

The ./llama-3.1-8b-ov folder contains the .xml and bin ir model files as well as the required configuration files from the source model. The tokenizer is also converted to the OpenVino-Tokenizers library format, and the corresponding configuration files are created in the same folder.

Step 3: Optimize the model

Model optimization is highly recommended when running LLMS on resource-constrained edges and client devices. Weight only quantization is a mainstream approach that significantly reduces latency and model footprint. Optimum-Intel provides weight-only quantization through the Neural Network Compression Framework (NNCF). It features a variety of optimization techniques designed specifically for LLMS. From data-free INT8 and INT4 weight quantization to data recognition methods such as AWQ, GPTQ, quantization scale estimation, and mixed quantification. By default, models with over 1 billion parameters are quantized to INT8 accuracy, which is safe in terms of accuracy. This means that the above export steps lead to the model with 8-bit weights. However, quantization of only 4-bit integer weights allows for better accuracy performance trade-offs.

For the Metalama/Metalama-3.1-8B model, it is recommended to stack AWQ, quantized scale estimations along with mixed-precision INT4/INT8 quantization of weights using a calibration dataset that reflects the deployment use case. As with exports, there are two options for applying 4-bit weight-only quantization to an LLM model.

Use the Python API

Specify the Quantization_config parameter in the .from_pretrained() method. In this case, you must create an ovweightquantizationconfig object and set it to this parameter as follows:

from optimum.intel Import ovmodelforcausalllm, ovweightquantizationconfig model_id = “Metalama/Metalama-3.1-8B”
Quantization_config = ovweightquantizationconfig(bits =4awq =truth,scale_estimation =truthgroup_size =64dataset =“C4”) model = ovmodelforcausallm.from_pretrained(model_id, export =truth,Quantization_config =Quantization_config)model.save_pretrained(“./llama-3.1-8b-ov”))

Using the Command Line Interface (CLI):

Optimum-Cli export OpenVino -M Meta-llama/Meta-llama-3.1-8b -weight-format int4 -awq – scale-destimation -group-size 64 – dataset wikitext2 ./llama-3.1-8b-ov

Note: The model optimization process takes time and then applies several methods and uses model inference on the specified dataset.

Optimizing models using the API is more flexible, for example, because you can use custom datasets that can be passed as reflectable objects, and instances of dataset objects or lists of strings in the 🤗 library.

Field quantization usually results in some degree of decomposition of precision metrics. To compare optimized and source models, we report word confusion metrics measured in the Wikitext dataset outside the box 🤗 in the LM Evaluation Harness Project that supports both transformers and optimal Intel models.

Model ppl pytorch fp32 openvino int8 openvino int4 meta-lama/meta-llama-3.1-8b 7.3366 7.3463 7.8288

Step 4: Deploy using OpenVino Genai API

After conversion and optimization, it is easy to deploy the model using OpenVino Genai. OpenVino Genai’s LLMPipeline class provides both Python and C++ APIs, supporting a variety of text generation methods with minimal dependencies.

Python API Example

Import Argparse
Import OpenVino_Genai Device= “CPU”
pipe = openvino_genai.llmpipeline(args.model_dir, device)config = openvino_genai.generationconfig()config.max_new_tokens = 100
printing(pipe.generate(args.prompt, config))

To run this example, OpenVino Genai is designed to provide a lightweight deployment, so you need a minimum dependency to install it in a Python environment. You can either install the OpenVino Genai packages in the same Python environment or create a different environment to compare application footprints.

PIP Install OpenVino-Genai == 24.3

Examples of C++ API

Let’s see how to run the same pipiraine in the OpenVino Genai C++ API. The Genai API is designed intuitively and provides a seamless transition from the Transformers API.

Note: In the example below, you can specify other available devices in your environment as the “Device” variable. For example, if you’re using an Intel CPU with integrated graphics, “GPU” is a good option to try. To check which devices are available, you can use the OV::Core::get_available_devices method (see query-device-properties).

#include “OpenVino/genai/llm_pipeline.hpp”
#include

int Main(int argc, char*argv()) {std :: string model_path = “./llama-3.1-8b-ov”;std :: string device = “CPU”
ov :: genai :: llmpipeline pipe(model_path, device); std ::cout <Generate(“What is the LLM model?”,ov :: genai ::max_new_tokens(256)); }

Customize the generation settings

In llmpipeline, you can also specify custom generation options using ov::genai::generationconfig.

OV :: Genai :: GenerationConfig Config; config.max_new_tokens = 256; std :: string result = pipe.Generate(prompt, config);

LLMPipieline not only allows users to easily leverage various decoding algorithms such as Beam search, but also allows users to build interactive chat scenarios with streamers, as in the example below. Additionally, you can take advantage of enhanced internal optimization using LLMPipeline, such as reducing prompt processing time using KV caches in previous chat history: start_chat() and finish_chat()

OV :: Genai :: GenerationConfig Config; config.max_new_tokens = 100; config.do_sample = truth; config.top_p = 0.9; config.top_k = 30;

Automatic Streamer =() (std :: String subword) {std :: cout << subword << std :: flush; return error; };Pipe.Generate(Prompt, config, streamer);

And finally, let’s take a look at how to use llmpipeline in chat scenarios.

pipe.start_chat()
for (size_t i = 0; i size();i++){std ::cout < “Question:\n”;std ::getline(std :: cin, prompt); std :: cout <Generate(Prompt) << std :: endl; }Pipe.finish_chat();

Conclusion

The combination of Optimum-Intel and OpenVino™ Genai provides a powerful and flexible solution for deploying face models that hug the edge. By following these steps, Python can achieve optimized high-performance AI inference in a less-than-ideal environment, ensuring applications run smoothly across Intel hardware.

Additional resources

Learn more in this tutorial. To create the above C++ example, see this document. OpenVino DocumentationJupyter Notebook Best Documentation

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleSurge in nvidia sales in the fourth quarter in response to AI chip demands
Next Article New Star: Discover why 보니 is the future of AI art

Related Posts

Tools

Unlocking conversion of web screenshots to HTML code using WebSight dataset

July 1, 2025
Tools

Easy to train your model using H100 GPU on nvidia dgx cloud

June 30, 2025
Tools

Best Pytorch Quantization Backend

June 29, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

BitMart Research: MCP+AI Agent – A new framework for AI

May 13, 20251 Views

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20251 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

BitMart Research: MCP+AI Agent – A new framework for AI

May 13, 20251 Views

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20251 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20251 Views
Don't Miss

AI Art Generation Using Primo Models: Converting Digital Illustrations for Creators | AI News Details

July 1, 2025

Silicon Valley Insider is revealing AI companies like cults

July 1, 2025

Unlocking conversion of web screenshots to HTML code using WebSight dataset

July 1, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?