Fast decoding with any assistant model

TL;DR: Many LLMs such as gemma-2-9b and Mixtral-8x22B-Instruct-v0.1 do not have much smaller versions to use for assistance generation. In this blog post, we introduce universal assistance generation. This is a technique developed by Intel Labs and Hugging Face that extends assisted generation to work with small language models from any model family 🤯. As a result, we can now speed up inference from any decoder or expert mixture model by 1.5x to 2.0x with almost zero overhead 🔥🔥🔥. Let’s dive in!

introduction

Today’s most powerful open-weight LLMs typically have billions to hundreds of billions of parameters (hello Llama-3.1-405B 👋), and bringing these beasts into production introduces a wide range of engineering challenges. Challenges arise. One such challenge is that text generation from these large models is slow. This has prompted the community to develop a wide range of techniques to speed up the decoding process. Assisted generation, also known as speculative decoding, is a very common and practical approach to speed up LLM inference without compromising accuracy. In this blog post, we’ll take a look at how assistance generation works and share our research to extend it to any of Hugging Face Hub’s 140,000 language models 🚀!

Auxiliary power generation

The core idea behind assistance generation involves the use of a pair of models called a target model and an assistant model. The assistant model is a smaller, more efficient version of the target model. For example, Llama-3.2-1B can be used as an assistant model to the larger Llama-3.1-70b target model. Auxiliary generation is an iterative process. In each cycle, the assistant model autoregressively generates a series of tokens, one at a time. The target model then validates all assistant tokens in the sequence in a single forward pass. The speedup is achieved by checking multiple tokens on each forward pass through the target model, instead of only generating one token at a time. See the original blog post for a more detailed explanation. When combined with recently introduced dynamic inference strategies, assisted generation can speed up text generation by 1.5x to 3x depending on the task and the model used.

The significant speedup provided by assisted generation has a significant drawback. The target model and assistant model must share the same tokenizer, meaning they must belong to the same model family. However, many widely used models do not have smaller versions that are compact and accurate enough to achieve significant latency reductions. Based on our experience, meaningful speedups are typically seen when the assistant model is at least 50-100 times smaller than the target model. For example, CodeLlama-13b does not have a smaller version and gemma-2-9b only has a 2b variant, but it is still not small/fast enough to achieve significant performance improvements.

universal auxiliary power generation

To alleviate this pain point, Intel Labs, in collaboration with our friends at Hugging Face, developed Universal Assisted Generation (UAG). UAG allows you to select any pair of target and assistant models, regardless of the tokenizer. For example, you can use gemma-2-9b as target model and small vicuna-68m as assistant.

The main idea behind our proposed method is bidirectional tokenizer translation. When the assistant model completes the generation iteration, the assistant token is converted to text and tokenized using the target model’s tokenizer to generate the target token. After the validation step, the target token is similarly converted to assistant token format and added to the context of the assistant model before the next iteration begins.

The assistant and target tokenizers use different vocabularies, so we need to handle mismatches between them. To accurately reencode newly generated assistant tokens, it is essential to prepend a context window consisting of some previous tokens. This entire sequence is then re-encoded into the target token format and aligned with the latest target token to determine the exact position to add the newly generated token. This process is explained in the video below.

Although not shown in the video above, re-encoding the token from the target to the assistant follows a similar process. However, to ensure data integrity, mismatched tokens must be discarded from the assistant model’s key/value (KV) cache.

benchmark

The table below shows the latency improvements observed in the target model when combined with the assistant model using various tokenizers.

Target model Assistant model Dataset Task acceleration codellama/CodeLlama-13b-Instruct-hf bigcode/tiny_starcoder_py openai/humaneval Code generation 1.90x mistralai/Mixtral-8x22B-Instruct-v0.1 double7/vicuna-68m cnn_dailymail summary 1.52x google/gemma-2-9b double7/vicuna-68m cnn_dailymail summary 1.76x misstralai/Mixtral-8x22B-Instruct-v0.1 Qwen/Qwen2-0.5B-Instruct tau/scrolls long context summary 1.78x metal-llama/Llama-3.1-70B Qwen/Qwen2- 0.5 B-Tau/Scroll Instructions Long Context Summary 1.78x Microsoft/Phi-3-medium-128k-instruct Qwen/Qwen2-0.5B-Instruct tau/scrolls Long Context Summary 1.91x

Note that the target model above does not have small variants (less than 1 billion parameters) that are suitable for acceleration using standard auxiliary generation.

Each experiment was performed on 100 randomly selected examples. Two and four A100 GPUs are used for the Llama and Mixtral target model experiments, respectively. All other experiments were performed on one A6000 GPU.

code

Universal auxiliary generation was integrated into 🤗 Transformers release 4.46.0.

To use, pass tokenizer and assistant_tokenizer to generate().

>>> from transformer import AutoModelForCausalLM, AutoTokenizer

>>> prompt = “Alice and Bob”
>>> Checkpoint = “google/gemma-2-9b”
>>> assistant_checkpoint = “Double 7/Vicuna-68m”

>>> Assistant_tokenizer = AutoTokenizer.from_pretrained(assistant_checkpoint)
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors=“pt”)

>>> Model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> Assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> output = model.generate(**input, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
>>> tokenizer.batch_decode(output, skip_special_tokens=truth) (“Alice and Bob are sitting at a bar. Alice is drinking a beer and Bob is drinking a beer.)

Future direction

Passing do_sample=True in standard assisted generation uses the speculative sampling algorithm (Algorithm 1 in the paper), but UAG currently only supports polynomial sampling. With multinomial sampling, tokens are automatically rejected if the target model does not sample the same tokens as the assistant, but this is not the case with speculative sampling. In practice, this means that the throughput of UAG with do_sample=True is lower compared to when the assistant has the same tokenizer. In the future, we plan to add support for speculative sampling with UAG. Additionally, we plan to integrate UAG into the 🤗 Transformers pipeline for more concise and streamlined usage.

References

See Full Bio

What's Hot

CAC has announced AI-powered business registration portal – thisdaylive

Research shows that AI can reduce global carbon emissions

AI Art Challenge: Everyday Giants will showcase the creativity of AI generated in 2025 | AI News Details

Research shows that AI can reduce global carbon emissions

Allow communities to use Argilla to embrace face spaces to collectively build better datasets

How much more jointly can a multimodal model be inferred than text-and-images in a rich scene?

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

PlanetScale Vectors GA: MySQL and AI Database Game Changer

Most Popular

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

PlanetScale Vectors GA: MySQL and AI Database Game Changer

Don't Miss

CAC has announced AI-powered business registration portal – thisdaylive

Research shows that AI can reduce global carbon emissions

AI Art Challenge: Everyday Giants will showcase the creativity of AI generated in 2025 | AI News Details

Subscribe to Updates

What's Hot

Fast decoding with any assistant model

introduction

Auxiliary power generation

universal auxiliary power generation

benchmark

code

Future direction

References

Related Posts