As model sizes grow, generation AI implementations require important inference resources. This not only increases the cost per generation, but also increases the power consumption used to meet such demands.
Optimizing text generation inference is essential to reduce latency, infrastructure costs and power consumption. This can improve the user experience and improve the efficiency of text generation tasks.
Assist decoding is a common method for speeding up text generation. As shown in a previous post, I adapted and optimized it for Intel Gaudi, which offers similar performance to the NVIDIA H100 GPU, but its price is in the same ballpark as the NVIDIA A100 80GB GPU. This work is currently part of Optimum Habana, extending various embrace face libraries such as Transformers and Diffuser to fully optimize AI workflows for Intel Gaudi processors.
Speculative Sampling – Assisted Decode
Speculative sampling is a technique used to speed up text generation. It works by generating a draft model that generates K-tokens and is evaluated on the target model. If the draft model is rejected, the target model is used to generate the next token: This process is repeated. By using speculative sampling, the speed of text generation can be improved and sampling quality similar to autoregressive sampling can be achieved. This technique allows you to specify a draft model when generating text. This method has been shown to provide about twice as fast as large-scale transformer-based models. Overall, these techniques can accelerate text generation and improve the performance of your Intel Gaudi processor.
However, since draft and target models have different sizes, expressed in KV caches, the challenge is to simultaneously utilize individual optimization strategies. This article assumes a quantized model and utilizes KV caches along with speculative sampling. Each model has its own KV cache, and the draft model is used to generate K tokens and is evaluated on the target model. The target model is used to generate the next token when the draft model is rejected. The draft model is used to generate the next K token and the process is repeated.
Note that the author (2) proves that the target distribution is recovered when performing speculative sampling. This ensures the same sampling quality as the autoregressive sampling of the target itself. Therefore, the situation where it is not worth noting to utilize speculative sampling is relevant when there is not enough savings in the relative size of the draft model, or when the acceptance rate of the draft model is not high enough to benefit from the smaller size of the draft model.
There is a technique similar to speculative sampling known as Assisted Generation. It was developed independently at about the same time (3). The author hugs this method to the face transformer, and the .generate() call has an optional Assistant_model parameter that enables this method.
Usage and experiment
Using Assist Generation is easy. An example is provided here. As expected, the parameter -Assistant_model is used to specify the draft model. The draft model is used to generate K-tokens and is evaluated on the target model. The target model is used to generate the next token when the draft model is rejected. The draft model is used to generate the next K token and the process is repeated. The acceptance rate of a draft model is partially dependent on the input text. Typically, models based on large transformers showed a speedup of about twice as much.
Conclusion
Using Gaudi to accelerate text generation using Assisted Generation is now supported and easier to use. This can be used to improve the performance of your Intel Gaudi processor. This method is based on speculative sampling, which has been shown to be effective in improving performance in large-scale transformer-based models.
(1) N. Shazeer, “Fast Transdecoding: It’s Everything You Need to Have One Write”, November 2019. Arxiv: 1911.02150.
(2) C. Chen, S. Borgeaud, G. Irving, JB Lespiau, L. Sifre, and J. Jumper, “Accelerating decoding of large-scale language models through speculative sampling,” February 2023. Arxiv: 2302.01318.
(3) J. Gante, “Assisted Generation: A New Direction towards Low-Lazy Text Generation,” May 2023, https://huggingface.co/blog/Assisted-generation.