Deploying transformer models on the edge or client side requires careful consideration of performance and compatibility. Python is powerful, but is not always ideal for such a deployment, especially in C++-dominated environments. In this blog, we use Optimum-Intel and OpenVino™ Genai to optimize and deploy hugging face transformer models to ensure efficient AI inference with minimal dependencies.
table of contents
Why use OpenVino™ for edge deployment Step 1: Setup your environment Step 2: Export your model to OpenVino IR Step 3: Optimize your model Step 4: Deploy with OpenVino Genai API conclusion
Why use OpenVino™ for Edge Deployment?
OpenVino™ was originally developed as a C++ AI inference solution, making it ideal for edge and client deployments where minimizing dependencies is critical. The introduction of the Genai API makes it even easier to integrate large-scale language models (LLMS) into C++ or Python applications, designed to simplify deployment and enhance performance.
Step 1: Set up your environment
Prerequisites
To get started, make sure your environment is properly configured in both Python and C++. Install the required Python packages.
PIP Install – Upgrade – Upgrade – Get your strategy enthusiastic “Optimum (OpenVino)”
The specific packages used in this blog post are:
Transformer== 4.44 OpenVino == 24.3 OpenVino-Tokenizers == 24.3 Optimum-Intel == 1.20 LM-Eval == 0.4.3
Follow these instructions to install the Genai C++ library.
Step 2: Export the model to OpenVino IR
We embraced the collaboration between Face and Intel, leading to the best Intel project. It is designed to optimize transformer models for Intel HW inference. Optimum-Intel supports OpenVino as an inference backend, and its API includes wrappers for various model architectures built on top of the OpenVino inference API. All of these wrappers start with an OV prefix, such as ovmodelforcausallm. Otherwise it’s similar to the API of the 🤗Trans Library.
Two options are available to export a trans model to an OpenVino intermediate representation (IR). This can be done using Python’s .from_pretrained() method or the best command line interface (CLI). Below is an example using both methods.
Use the Python API
from optimum.intel Import ovmodelforcausallm model_id = “Metalama/Metalama-3.1-8B”
Model = ovmodelforcausallm.from_pretrained(model_id, export =truth)model.save_pretrained(“./llama-3.1-8b-ov”))
Using the Command Line Interface (CLI)
Optimum-Cli export OpenVino -M Meta-lama/Meta-llama-3.1-8b ./llama-3.1-8b-ov
The ./llama-3.1-8b-ov folder contains the .xml and bin ir model files as well as the required configuration files from the source model. The tokenizer is also converted to the OpenVino-Tokenizers library format, and the corresponding configuration files are created in the same folder.
Step 3: Optimize the model
Model optimization is highly recommended when running LLMS on resource-constrained edges and client devices. Weight only quantization is a mainstream approach that significantly reduces latency and model footprint. Optimum-Intel provides weight-only quantization through the Neural Network Compression Framework (NNCF). It features a variety of optimization techniques designed specifically for LLMS. From data-free INT8 and INT4 weight quantization to data recognition methods such as AWQ, GPTQ, quantization scale estimation, and mixed quantification. By default, models with over 1 billion parameters are quantized to INT8 accuracy, which is safe in terms of accuracy. This means that the above export steps lead to the model with 8-bit weights. However, quantization of only 4-bit integer weights allows for better accuracy performance trade-offs.
For the Metalama/Metalama-3.1-8B model, it is recommended to stack AWQ, quantized scale estimations along with mixed-precision INT4/INT8 quantization of weights using a calibration dataset that reflects the deployment use case. As with exports, there are two options for applying 4-bit weight-only quantization to an LLM model.
Use the Python API
Specify the Quantization_config parameter in the .from_pretrained() method. In this case, you must create an ovweightquantizationconfig object and set it to this parameter as follows:
from optimum.intel Import ovmodelforcausalllm, ovweightquantizationconfig model_id = “Metalama/Metalama-3.1-8B”
Quantization_config = ovweightquantizationconfig(bits =4awq =truth,scale_estimation =truthgroup_size =64dataset =“C4”) model = ovmodelforcausallm.from_pretrained(model_id, export =truth,Quantization_config =Quantization_config)model.save_pretrained(“./llama-3.1-8b-ov”))
Using the Command Line Interface (CLI):
Optimum-Cli export OpenVino -M Meta-llama/Meta-llama-3.1-8b -weight-format int4 -awq – scale-destimation -group-size 64 – dataset wikitext2 ./llama-3.1-8b-ov
Note: The model optimization process takes time and then applies several methods and uses model inference on the specified dataset.
Optimizing models using the API is more flexible, for example, because you can use custom datasets that can be passed as reflectable objects, and instances of dataset objects or lists of strings in the 🤗 library.
Field quantization usually results in some degree of decomposition of precision metrics. To compare optimized and source models, we report word confusion metrics measured in the Wikitext dataset outside the box 🤗 in the LM Evaluation Harness Project that supports both transformers and optimal Intel models.
Model ppl pytorch fp32 openvino int8 openvino int4 meta-lama/meta-llama-3.1-8b 7.3366 7.3463 7.8288
Step 4: Deploy using OpenVino Genai API
After conversion and optimization, it is easy to deploy the model using OpenVino Genai. OpenVino Genai’s LLMPipeline class provides both Python and C++ APIs, supporting a variety of text generation methods with minimal dependencies.
Python API Example
Import Argparse
Import OpenVino_Genai Device= “CPU”
pipe = openvino_genai.llmpipeline(args.model_dir, device)config = openvino_genai.generationconfig()config.max_new_tokens = 100
printing(pipe.generate(args.prompt, config))
To run this example, OpenVino Genai is designed to provide a lightweight deployment, so you need a minimum dependency to install it in a Python environment. You can either install the OpenVino Genai packages in the same Python environment or create a different environment to compare application footprints.
PIP Install OpenVino-Genai == 24.3
Examples of C++ API
Let’s see how to run the same pipiraine in the OpenVino Genai C++ API. The Genai API is designed intuitively and provides a seamless transition from the Transformers API.
Note: In the example below, you can specify other available devices in your environment as the “Device” variable. For example, if you’re using an Intel CPU with integrated graphics, “GPU” is a good option to try. To check which devices are available, you can use the OV::Core::get_available_devices method (see query-device-properties).
int Main(int argc, char*argv()) {std :: string model_path = “./llama-3.1-8b-ov”;std :: string device = “CPU”
ov :: genai :: llmpipeline pipe(model_path, device); std ::cout <
Customize the generation settings
In llmpipeline, you can also specify custom generation options using ov::genai::generationconfig.
OV :: Genai :: GenerationConfig Config; config.max_new_tokens = 256; std :: string result = pipe.Generate(prompt, config);
LLMPipieline not only allows users to easily leverage various decoding algorithms such as Beam search, but also allows users to build interactive chat scenarios with streamers, as in the example below. Additionally, you can take advantage of enhanced internal optimization using LLMPipeline, such as reducing prompt processing time using KV caches in previous chat history: start_chat() and finish_chat()
OV :: Genai :: GenerationConfig Config; config.max_new_tokens = 100; config.do_sample = truth; config.top_p = 0.9; config.top_k = 30;
Automatic Streamer =() (std :: String subword) {std :: cout << subword << std :: flush; return error; };Pipe.Generate(Prompt, config, streamer);
And finally, let’s take a look at how to use llmpipeline in chat scenarios.
pipe.start_chat()
for (size_t i = 0; i
Conclusion
The combination of Optimum-Intel and OpenVino™ Genai provides a powerful and flexible solution for deploying face models that hug the edge. By following these steps, Python can achieve optimized high-performance AI inference in a less-than-ideal environment, ensuring applications run smoothly across Intel hardware.
Additional resources
Learn more in this tutorial. To create the above C++ example, see this document. OpenVino DocumentationJupyter Notebook Best Documentation