TL;DR: Many LLMs such as gemma-2-9b and Mixtral-8x22B-Instruct-v0.1 do not have much smaller versions to use for assistance generation. In this blog post, we introduce universal assistance generation. This is a technique developed by Intel Labs and Hugging Face that extends assisted generation to work with small language models from any model family 🤯. As a result, we can now speed up inference from any decoder or expert mixture model by 1.5x to 2.0x with almost zero overhead 🔥🔥🔥. Let’s dive in!
introduction
Today’s most powerful open-weight LLMs typically have billions to hundreds of billions of parameters (hello Llama-3.1-405B 👋), and bringing these beasts into production introduces a wide range of engineering challenges. Challenges arise. One such challenge is that text generation from these large models is slow. This has prompted the community to develop a wide range of techniques to speed up the decoding process. Assisted generation, also known as speculative decoding, is a very common and practical approach to speed up LLM inference without compromising accuracy. In this blog post, we’ll take a look at how assistance generation works and share our research to extend it to any of Hugging Face Hub’s 140,000 language models 🚀!
Auxiliary power generation
The core idea behind assistance generation involves the use of a pair of models called a target model and an assistant model. The assistant model is a smaller, more efficient version of the target model. For example, Llama-3.2-1B can be used as an assistant model to the larger Llama-3.1-70b target model. Auxiliary generation is an iterative process. In each cycle, the assistant model autoregressively generates a series of tokens, one at a time. The target model then validates all assistant tokens in the sequence in a single forward pass. The speedup is achieved by checking multiple tokens on each forward pass through the target model, instead of only generating one token at a time. See the original blog post for a more detailed explanation. When combined with recently introduced dynamic inference strategies, assisted generation can speed up text generation by 1.5x to 3x depending on the task and the model used.
The significant speedup provided by assisted generation has a significant drawback. The target model and assistant model must share the same tokenizer, meaning they must belong to the same model family. However, many widely used models do not have smaller versions that are compact and accurate enough to achieve significant latency reductions. Based on our experience, meaningful speedups are typically seen when the assistant model is at least 50-100 times smaller than the target model. For example, CodeLlama-13b does not have a smaller version and gemma-2-9b only has a 2b variant, but it is still not small/fast enough to achieve significant performance improvements.
universal auxiliary power generation
To alleviate this pain point, Intel Labs, in collaboration with our friends at Hugging Face, developed Universal Assisted Generation (UAG). UAG allows you to select any pair of target and assistant models, regardless of the tokenizer. For example, you can use gemma-2-9b as target model and small vicuna-68m as assistant.
The main idea behind our proposed method is bidirectional tokenizer translation. When the assistant model completes the generation iteration, the assistant token is converted to text and tokenized using the target model’s tokenizer to generate the target token. After the validation step, the target token is similarly converted to assistant token format and added to the context of the assistant model before the next iteration begins.
The assistant and target tokenizers use different vocabularies, so we need to handle mismatches between them. To accurately reencode newly generated assistant tokens, it is essential to prepend a context window consisting of some previous tokens. This entire sequence is then re-encoded into the target token format and aligned with the latest target token to determine the exact position to add the newly generated token. This process is explained in the video below.
Although not shown in the video above, re-encoding the token from the target to the assistant follows a similar process. However, to ensure data integrity, mismatched tokens must be discarded from the assistant model’s key/value (KV) cache.
benchmark
The table below shows the latency improvements observed in the target model when combined with the assistant model using various tokenizers.
Target model Assistant model Dataset Task acceleration codellama/CodeLlama-13b-Instruct-hf bigcode/tiny_starcoder_py openai/humaneval Code generation 1.90x mistralai/Mixtral-8x22B-Instruct-v0.1 double7/vicuna-68m cnn_dailymail summary 1.52x google/gemma-2-9b double7/vicuna-68m cnn_dailymail summary 1.76x misstralai/Mixtral-8x22B-Instruct-v0.1 Qwen/Qwen2-0.5B-Instruct tau/scrolls long context summary 1.78x metal-llama/Llama-3.1-70B Qwen/Qwen2- 0.5 B-Tau/Scroll Instructions Long Context Summary 1.78x Microsoft/Phi-3-medium-128k-instruct Qwen/Qwen2-0.5B-Instruct tau/scrolls Long Context Summary 1.91x
Note that the target model above does not have small variants (less than 1 billion parameters) that are suitable for acceleration using standard auxiliary generation.
Each experiment was performed on 100 randomly selected examples. Two and four A100 GPUs are used for the Llama and Mixtral target model experiments, respectively. All other experiments were performed on one A6000 GPU.
code
Universal auxiliary generation was integrated into 🤗 Transformers release 4.46.0.
To use, pass tokenizer and assistant_tokenizer to generate().
from transformer import AutoModelForCausalLM, AutoTokenizer
“Alice and Bob”
Checkpoint = “google/gemma-2-9b”
assistant_checkpoint = “Double 7/Vicuna-68m”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors=“pt”)
Assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
output = model.generate(**input, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
tokenizer.batch_decode(output, skip_special_tokens=truth) (“Alice and Bob are sitting at a bar. Alice is drinking a beer and Bob is drinking a beer.)
Future direction
Passing do_sample=True in standard assisted generation uses the speculative sampling algorithm (Algorithm 1 in the paper), but UAG currently only supports polynomial sampling. With multinomial sampling, tokens are automatically rejected if the target model does not sample the same tokens as the assistant, but this is not the case with speculative sampling. In practice, this means that the throughput of UAG with do_sample=True is lower compared to when the assistant has the same tokenizer. In the future, we plan to add support for speculative sampling with UAG. Additionally, we plan to integrate UAG into the 🤗 Transformers pipeline for more concise and streamlined usage.
References