Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

tl;dr:

The QWEN3-8B is one of the most exciting recent releases. This is a model with native agent functionality and naturally fits AIPC.

With OpenVino.Genai, I was able to accelerate the generation by ~1.3 times using speculative decoding on a lightweight QWEN3-0.6B draft.

Using speculative decoding, we further pushed the speedup to ~1.4× by applying a simple pruning process to the draft.

I wrapped this by showing how I can use these improvements to run local AI agents at high speed using Smolagents

QWEN3

QWEN3-8B is part of the latest Qwen family trained with explicit agent behavior. It supports long context processing capabilities that are suitable for tool invocation, multi-step inference, and complex agent workflows. Integrating with frameworks such as Face’s smolagents, Qwenagent, and Autogen, allows for a wide range of agent applications built around tool use and inference. Unlike single-turn chatbots, agent applications rely on inference models that generate “voice and voice” tracing, intermediate steps that extend token usage, making inference speed important for responsiveness. The combination of optimized inference and embedded agent intelligence makes the QWEN3-8B a compelling foundation for next-generation AI agents.

Accelerate QWEN3-8B with Intel® Core™ Ultra with speculative decoding

Starting by benchmarking the 4-bit optimized OpenVino version of QWEN3-8B on an Intel Lunar Lake Integrated GPU, establishing this as a baseline for further acceleration

Speculative decoding is a way to speed up automatic regression generation. It works by using smaller, faster models as drafts and proposing multiple tokens on a single forward pass. This is validated by a large target model with one forward pass. In the setup, QWEN3-8B acted as the target model, while QWEN3-0.6B was used as the draft. This approach provided an average of 1.3×speedup on baseline.

from OpenVino_Genai Import llmpipeline, drawt_model target_path = “/path/to/target/qwen3-8b-int4-ov”
drawt_path = “/path/to/draft/qwen3-0.6b-int8-ov”
Device= “GPU”

Model = llmpipeline(target_path, device, drawt_model = drawt_model(drawt_path, device)) streamer = lambda X: printing(x, end =“”flash =truth)model.generate(“What is speculative decoding? How do you improve the speed of inference?”,max_new_tokens =100,Reamer = Streamer)

Before initializing LLMPipeline, make sure both the target and draft models are converted to OpenVino. You can download pre-converted models from the provided links or follow these instructions to follow your own instructions.

Push more performance

Speculative decoding speedup depends on the average number of tokens generated for each target forward step. $γ\Gamma$

$e(#generated_tokens)γc + 1 speedup = \frac {e(\#generated\_tokens)} {\gamma c + 1}$

Our recent work shows that model depth (number of layers) is the major contributor of inference latency. We took inspiration from recent research on layer-by-layer compression (1). Our approach identifies blocks of layers that are measured using angular distances and remove them, with little contribution. After pruning, apply fine adjustments to restore accuracy. This method was used to prun six of the 28 layers from the QWEN3-0.6B draft model. To restore the quality of the pruned draft model, further fine-tuning was performed using synthetic data generated by QWEN3-8B. Data was generated by generating a response to a 500K prompt from the BAAI/Infinity-Instruct dataset.

The resulting Pruñon Draft model provided about 1.4 times faster than baseline, improving over the approximately 1.3x gain achieved in the original draft. This result is consistent with theoretical expectations – reducing draft latency improves overspeeding and allows for faster and more efficient inference.

This shows how pruning + speculative decoding can unleash faster and more efficient reasoning.

Check the notebook and the draft model based on the depth of QWEN3-0.6B and reproduce the results step by step

Integration with smolagents

To showcase real-world possibilities, we have deployed an optimized setup with the smolagents library. This integration allows developers to plug in QWEN3-8B (paired with Prundraft) to invoke APIs and external tools, write and run code, handle long contest inferences, and build agents that run efficiently with Intel® Core™ Ultra. The merit is not limited to hugging your face. Pairing this model can also be used seamlessly in frameworks such as Autogen and Qwenagent to further strengthen the agent ecosystem.

In the demo, we assigned tasks to accelerated QWEN3-based agents.

Here’s how it works: 1. The agent used a web search tool to collect the latest information. 2. I then switched to the Python interpreter to generate slides in the Python -Pptx library. This simple workflow highlights just a few of the possibilities that accelerated QWEN3 models could have been unlocked if they meet frameworks like smolagents, enabling practical and efficient AI agents on AI PCs. Try it here

https://www.youtube.com/watch?v=irsd5lnxik

reference

(1) Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P. , & Roberts, DA (2025, January 22nd). Absurd inefficiency in the deeper layers. A poster announced at ICLR 2025. https://arxiv.org/abs/2403.17887

Performance and Legal Notices

Performance results are based on the OpenVino™ 2025.2 internal benchmarks as of September 2025 and are paired with 32 GB of DDR5 memory with integrated Intel® ARC™ 140V GPU using an Intel® Core™ Ultra 7 268V 2.20 GHz processor. Performance will vary based on usage, configuration, and other factors. For more information, please visit www.intel.com/performanceindex. There are absolutely no safe products or components. Your cost and outcome may differ. Intel technology may require the activation of valid hardware, software, or services. ©Intel Corporation. Intel, the Intel logo and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed to be other names.

versatileai

See Full Bio

What's Hot

Dataset recording, VLA fine-tuning, and on-device optimization

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

Dataset recording, VLA fine-tuning, and on-device optimization

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Most Popular

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Don't Miss

Dataset recording, VLA fine-tuning, and on-device optimization

Update to Gemini 2.5 from Google DeepMind

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

Subscribe to Updates

What's Hot

Accelerated depth pronune draft model for the QWEN3-8B ​​agent from Intel® Core™ Ultra

QWEN3

Accelerate QWEN3-8B ​​with Intel® Core™ Ultra with speculative decoding

Push more performance

Integration with smolagents

reference

Related Posts

Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

Accelerate QWEN3-8B with Intel® Core™ Ultra with speculative decoding