tl;dr:
The QWEN3-8B is one of the most exciting recent releases. This is a model with native agent functionality and naturally fits AIPC.
With OpenVino.Genai, I was able to accelerate the generation by ~1.3 times using speculative decoding on a lightweight QWEN3-0.6B draft.
Using speculative decoding, we further pushed the speedup to ~1.4× by applying a simple pruning process to the draft.
I wrapped this by showing how I can use these improvements to run local AI agents at high speed using Smolagents
QWEN3
QWEN3-8B is part of the latest Qwen family trained with explicit agent behavior. It supports long context processing capabilities that are suitable for tool invocation, multi-step inference, and complex agent workflows. Integrating with frameworks such as Face’s smolagents, Qwenagent, and Autogen, allows for a wide range of agent applications built around tool use and inference. Unlike single-turn chatbots, agent applications rely on inference models that generate “voice and voice” tracing, intermediate steps that extend token usage, making inference speed important for responsiveness. The combination of optimized inference and embedded agent intelligence makes the QWEN3-8B a compelling foundation for next-generation AI agents.
Accelerate QWEN3-8B with Intel® Core™ Ultra with speculative decoding
Starting by benchmarking the 4-bit optimized OpenVino version of QWEN3-8B on an Intel Lunar Lake Integrated GPU, establishing this as a baseline for further acceleration
Speculative decoding is a way to speed up automatic regression generation. It works by using smaller, faster models as drafts and proposing multiple tokens on a single forward pass. This is validated by a large target model with one forward pass. In the setup, QWEN3-8B acted as the target model, while QWEN3-0.6B was used as the draft. This approach provided an average of 1.3×speedup on baseline.
from OpenVino_Genai Import llmpipeline, drawt_model target_path = “/path/to/target/qwen3-8b-int4-ov”
drawt_path = “/path/to/draft/qwen3-0.6b-int8-ov”
Device= “GPU”
Model = llmpipeline(target_path, device, drawt_model = drawt_model(drawt_path, device)) streamer = lambda X: printing(x, end =“”flash =truth)model.generate(“What is speculative decoding? How do you improve the speed of inference?”,max_new_tokens =100,Reamer = Streamer)
Before initializing LLMPipeline, make sure both the target and draft models are converted to OpenVino. You can download pre-converted models from the provided links or follow these instructions to follow your own instructions.
Push more performance
Speculative decoding speedup depends on the average number of tokens generated for each target forward step. γ\Gamma speculative window size, and target and draft model latency ratios CC . Smaller, faster (but not accurate) drafts can often achieve greater acceleration. This encouraged the draft model to be reduced while retaining its quality. e(#generated_tokens)e(\#generated\_tokens) .
speedup = e(#generated_tokens)γc + 1 speedup = \frac {e(\#generated\_tokens)} {\gamma c + 1}
Our recent work shows that model depth (number of layers) is the major contributor of inference latency. We took inspiration from recent research on layer-by-layer compression (1). Our approach identifies blocks of layers that are measured using angular distances and remove them, with little contribution. After pruning, apply fine adjustments to restore accuracy. This method was used to prun six of the 28 layers from the QWEN3-0.6B draft model. To restore the quality of the pruned draft model, further fine-tuning was performed using synthetic data generated by QWEN3-8B. Data was generated by generating a response to a 500K prompt from the BAAI/Infinity-Instruct dataset.
The resulting Pruñon Draft model provided about 1.4 times faster than baseline, improving over the approximately 1.3x gain achieved in the original draft. This result is consistent with theoretical expectations – reducing draft latency improves overspeeding and allows for faster and more efficient inference.
This shows how pruning + speculative decoding can unleash faster and more efficient reasoning.
Check the notebook and the draft model based on the depth of QWEN3-0.6B and reproduce the results step by step
Integration with smolagents
To showcase real-world possibilities, we have deployed an optimized setup with the smolagents library. This integration allows developers to plug in QWEN3-8B (paired with Prundraft) to invoke APIs and external tools, write and run code, handle long contest inferences, and build agents that run efficiently with Intel® Core™ Ultra. The merit is not limited to hugging your face. Pairing this model can also be used seamlessly in frameworks such as Autogen and Qwenagent to further strengthen the agent ecosystem.
In the demo, we assigned tasks to accelerated QWEN3-based agents.
Here’s how it works: 1. The agent used a web search tool to collect the latest information. 2. I then switched to the Python interpreter to generate slides in the Python -Pptx library. This simple workflow highlights just a few of the possibilities that accelerated QWEN3 models could have been unlocked if they meet frameworks like smolagents, enabling practical and efficient AI agents on AI PCs. Try it here
https://www.youtube.com/watch?v=irsd5lnxik
reference
(1) Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P. , & Roberts, DA (2025, January 22nd). Absurd inefficiency in the deeper layers. A poster announced at ICLR 2025. https://arxiv.org/abs/2403.17887
Performance and Legal Notices
Performance results are based on the OpenVino™ 2025.2 internal benchmarks as of September 2025 and are paired with 32 GB of DDR5 memory with integrated Intel® ARC™ 140V GPU using an Intel® Core™ Ultra 7 268V 2.20 GHz processor. Performance will vary based on usage, configuration, and other factors. For more information, please visit www.intel.com/performanceindex. There are absolutely no safe products or components. Your cost and outcome may differ. Intel technology may require the activation of valid hardware, software, or services. ©Intel Corporation. Intel, the Intel logo and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed to be other names.

