Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Creating innovative content at your fingertips

July 4, 2025

The UK and Singapore form an alliance to guide AI into finance

July 4, 2025

StarCoder2 and Stack V2

July 4, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, July 4
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Intel Gaudi Fast Generation Support
Tools

Intel Gaudi Fast Generation Support

versatileaiBy versatileaiApril 29, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email


Neha Raste's avatar




As model sizes grow, generation AI implementations require important inference resources. This not only increases the cost per generation, but also increases the power consumption used to meet such demands.

Optimizing text generation inference is essential to reduce latency, infrastructure costs and power consumption. This can improve the user experience and improve the efficiency of text generation tasks.

Assist decoding is a common method for speeding up text generation. As shown in a previous post, I adapted and optimized it for Intel Gaudi, which offers similar performance to the NVIDIA H100 GPU, but its price is in the same ballpark as the NVIDIA A100 80GB GPU. This work is currently part of Optimum Habana, extending various embrace face libraries such as Transformers and Diffuser to fully optimize AI workflows for Intel Gaudi processors.

Speculative Sampling – Assisted Decode

Speculative sampling is a technique used to speed up text generation. It works by generating a draft model that generates K-tokens and is evaluated on the target model. If the draft model is rejected, the target model is used to generate the next token: This process is repeated. By using speculative sampling, the speed of text generation can be improved and sampling quality similar to autoregressive sampling can be achieved. This technique allows you to specify a draft model when generating text. This method has been shown to provide about twice as fast as large-scale transformer-based models. Overall, these techniques can accelerate text generation and improve the performance of your Intel Gaudi processor.

However, since draft and target models have different sizes, expressed in KV caches, the challenge is to simultaneously utilize individual optimization strategies. This article assumes a quantized model and utilizes KV caches along with speculative sampling. Each model has its own KV cache, and the draft model is used to generate K tokens and is evaluated on the target model. The target model is used to generate the next token when the draft model is rejected. The draft model is used to generate the next K token and the process is repeated.

Note that the author (2) proves that the target distribution is recovered when performing speculative sampling. This ensures the same sampling quality as the autoregressive sampling of the target itself. Therefore, the situation where it is not worth noting to utilize speculative sampling is relevant when there is not enough savings in the relative size of the draft model, or when the acceptance rate of the draft model is not high enough to benefit from the smaller size of the draft model.

There is a technique similar to speculative sampling known as Assisted Generation. It was developed independently at about the same time (3). The author hugs this method to the face transformer, and the .generate() call has an optional Assistant_model parameter that enables this method.

Usage and experiment

Using Assist Generation is easy. An example is provided here. As expected, the parameter -Assistant_model is used to specify the draft model. The draft model is used to generate K-tokens and is evaluated on the target model. The target model is used to generate the next token when the draft model is rejected. The draft model is used to generate the next K token and the process is repeated. The acceptance rate of a draft model is partially dependent on the input text. Typically, models based on large transformers showed a speedup of about twice as much.

Conclusion

Using Gaudi to accelerate text generation using Assisted Generation is now supported and easier to use. This can be used to improve the performance of your Intel Gaudi processor. This method is based on speculative sampling, which has been shown to be effective in improving performance in large-scale transformer-based models.

(1) N. Shazeer, “Fast Transdecoding: It’s Everything You Need to Have One Write”, November 2019. Arxiv: 1911.02150.

(2) C. Chen, S. Borgeaud, G. Irving, JB Lespiau, L. Sifre, and J. Jumper, “Accelerating decoding of large-scale language models through speculative sampling,” February 2023. Arxiv: 2302.01318.

(3) J. Gante, “Assisted Generation: A New Direction towards Low-Lazy Text Generation,” May 2023, https://huggingface.co/blog/Assisted-generation.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleMeta’s new AI app offers chatbots with a social media twist
Next Article China’s AI Job Mirage – China Media Project
versatileai

Related Posts

Tools

The UK and Singapore form an alliance to guide AI into finance

July 4, 2025
Tools

StarCoder2 and Stack V2

July 4, 2025
Tools

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

July 3, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views
Don't Miss

Creating innovative content at your fingertips

July 4, 2025

The UK and Singapore form an alliance to guide AI into finance

July 4, 2025

StarCoder2 and Stack V2

July 4, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?