Xeon’s best accelerates star coder with Intel: Q8/Q4 and speculative decoding

Recently, code generation models have become extremely popular, with the release of state-of-the-art open source models, especially BigCode’s Starcoder and Meta AI’s code Llama. More and more works are focusing on making large-scale language models (LLMs) more accessible. In this blog, we are happy to share the latest results of LLM optimizations on Intel Xeon, focusing on Starcoder, a popular code generation LLM.

The Starcoder model is a state-of-the-art LLM specifically designed to assist users with a variety of coding tasks, including code completion, bug fixes, code summaries, and even generating code snippets from natural language descriptions. The StarCoder model is a member of the Starcoder family, which also includes the StarCoderbase variant. These large language models of code (code LLM) are trained on over 80 programming languages, GIT commits, GitHub issues, and authorized license data from GitHub, including Jupyter notes. This work demonstrates more than 7x inference acceleration than the StarCoder-15B model of Intel 4th Generation Xeon by integrating 8-bit and 4-bit quantization and assist generation.

Try this demo hugging the facespace running on a fourth-generation Intel Xeon scalable processor.

Step 1: Baselines and evaluations

Establish a baseline using Starcoder (15b) in conjunction with Pytorch and Pytorch (IPEX) Intel Extensions. There are several datasets designed to assess the quality of automatic code completion. This task uses the popular Humanval Dataset to assess the quality and performance of the model. Humanval consists of 164 programming problems in the form of function signatures using Docstring, and the model completes the code for the function. The average length of the prompt is 139. Measure quality using a big code evaluation harness and report the path@1 metric. Measure the performance of the model by measuring the time to the first token (TTFT) with a Human Mar test set, and report the average TTFT and TPOT. The fourth generation of Intel Xeon processors feature AI injection acceleration known as Intel® Advanced Matrix Extensions (Intel®AMX). Specifically, all cores incorporate BFLOAT16 (BF16) and INT8 GEMM accelerators to accelerate deep learning training and inference workloads. AMX accelerated inference is introduced via Intel Extensions in Pytorch 2.0 and Pytorch (IPEX) in addition to other optimizations of various common operators used in LLM inference (e.g. Layer normalization, softmax, scaling DOT products). As a starting point, perform inference using the BF16 model using box optimizations immediately in Pytorch and IPEX. Figure 1 shows the delays for the baseline model, while Tables 1 and 2 show the delays and their accuracy.

Figure 1. Baseline model latency.

LLM quantization

Text generation in LLMS is performed in an autoregressive way, so the entire model must be loaded from memory to the CPU for each new token generation. It can be seen that off-chip memory (DRAM) and CPU bandwidth bring about the biggest bottlenecks in the token generation process. Quantization is a common approach to alleviating this problem. Reduces the model size, reducing the weight load time of the model.

This work focuses on two types of quantization.

Weight only quantization (WOQ) – Activation is not quantized while the calculation is performed with greater accuracy (e.g. BF16). Static Quantization (SQ) – Both weights and activation are quantized. This quantization process involves pre-calculating quantization parameters via calibration steps that allow the calculation to be performed with less precision (e.g. INT8). Figure 2 illustrates the INT8 static quantization calculation process.

Step 2: 8-bit quantization (INT8)

SmoothQuant is a training post quantization algorithm used to quantize the LLMS of INT8 with minimal accuracy loss. Static quantization methods showed poor performance in LLM due to the large magnitude found in specific channels of activation. Because activation is quantized on a per-token basis, static quantization results in either truncated outliers or low integral activations. The SmoothQuant algorithm solves this problem by applying additional smoothing scaling factors to both activation and weights, smoothing outliers in activation and introducing a pre-quantization phase that ensures better utilization of quantization levels.

Figure 2. Computational diagram for INT8 static quantization.

Apply SmoothQuant to your StarCoder model using IPEX. Q8-Starcoder was introduced using test splitting of the MBPP dataset as the calibration dataset. Our ratings show that the Q8-Starcoder does not retain loss of accuracy on the baseline (and in fact, there are even minor improvements). In terms of performance, the Q8-Starcoder speeds up about 2.19 times with TTFT and about 2.20 times with TPOT. Figure 3 shows the latency (TPOT) of Q8-starcoder compared to the BF16 baseline model.

Figure 3. Increased latency for 8-bit quantization models.

Step 3: 4-bit quantization (INT4)

INT8 reduces the model size by 2x compared to BF16 (8 bits per weight compared to 16 bits), but memory bandwidth remains the biggest bottleneck. To further reduce model loading times from memory, we quantized the model weights to 4 bits using WOQ. Note that in 4-bit WOQ, an instability to 16 bits is required before computation (Figure 4). This means there is computation overhead.

Figure 4. Computational diagram of a model quantized to INT4.

The basic WOQ technique, quantization of tensor-by-tensor asymmetric rounds (RTNs), presents challenges and often results in reduced accuracy, but the literature (Zhewei Yao, 2022) shows that per-group quantization of the weights of the model helps to maintain accuracy (Zhewei Yao, 2022). To avoid degradation of accuracy, a 4-bit quantization is performed on a group of resulting values along the input channel (e.g. 128) and a scaling factor is calculated for each group. The GroupWise 4Bit RTN has been found to be sufficient to preserve the accuracy of Starcoder’s Humanval Dataset. The 4-bit model achieves a 3.35x speedup at TPOT compared to the BF16 baseline (Figure 5), but suffers from a predicted deceleration of 0.84x in TTFT (Table 1) due to the overhead of dematerializing the 4-bit to 16-bit before calculation.

Figure 5. Increased latency for 4-bit quantization models.

Different bottlenecks between generating the first token and subsequent token

The first step of generating the first token, which involves parallelizing the entire input prompt, requests important computational resources when the prompt is high. Therefore, calculations become a bottleneck at this stage. Therefore, switching from BF16 to INT8 accuracy in this process improves performance compared to baseline (and 4-bit WOQ, which calculates overhead in the form of dequantification). However, starting with the second step, when the system autoregressively generates the remaining tokens, the model is loaded from memory multiple times for each new generated token. As a result, the bottleneck is memory bandwidth rather than the number of calculations performed (flops), and therefore INT4 outweighs INT8 and BF16.

Step 4: Assist Generation (AG)

Another way to reduce high inference latency and reduce memory bandwidth bottleneck problems is Assisted Generation (AG), a practical implementation of speculative decoding. AG reduces this problem by improving the balance between memory and calculation operations. It relies on the assumption that small, fast assistant draft models often generate the same tokens as larger target models.

AG greedily generates K candidate tokens using a small, fast draft model. These output tokens are generated much faster, but some of them may not resemble the output tokens of the original target model. Therefore, in the next step, the target model checks the validity of all K-candidate tokens in parallel with a single forward pass. This process speeds up decoding because the delay in parallel decoding of K tokens is less than automatic network generation of K tokens.

To accelerate StarCoder, use BigCode/tiny_starcoder_py as your draft model. This model shares a similar architecture to Starcoder, but only includes 164m parameters, which are ~95x smaller than Starcoder and therefore much faster. In addition to quantizing the target model, quantization is also applied to draft models to achieve even greater speedup. We consider both 8-bit Smoothquant and 4-bit WOQ quantization for the draft and target models. Evaluating the quantization options for both draft and target models, we found that 8-bit SmoothQuant for both models yields the best results: ~7.30x speedup with TPOT (Figure 6).

These quantization choices are backed up by the following observations:

Quantization of the draft model: When using an 8-bit quantized star coder with 164m parameters as the draft model, the model will mostly fit into the CPU cache. As a result, memory bandwidth bottlenecks are reduced as token generation occurs without repeated reading of the target model from each token’s chip-off chip memory. In this case there is no memory bottleneck. Additionally, compared to the StarCoder-164M, which is quantized to 4-bit WOQ, we see a better speedup for the StarCoder-164M when quantized to 8-bit. Note that 4-bit WOQ retains the advantage when memory bandwidth is a bottleneck because memory bandwidth has a small memory footprint. However, 4-bit comes with computation overhead as it requires 4-16-bit destabilization to be performed before computation. Quantization of the Target Model: In assist generation, the target model processes the sequence of K-tokens generated by the draft model. Instead of applying “standard” sequential autorailing processing, transferring K tokens at once (in parallel) via the target model shifts the balance from the memory bandwidth to calculate the bottleneck. Therefore, we observed that using an 8-bit quantized target model yields higher speeds than using a 4-bit model due to the additional computational overhead caused by deconting all values from 4-bit to 16-bit.

Figure 6. Increased latency for optimized models.

Starcoder Quantization Precision Human Eval (Pass@1)TTFT (MS)TTFT Speedup TPOT (MS)TPOT SpeedUp Baseline None A16W16 33.54 357.9 1.00X 181.0 1.00X INT8 SMOUSEQUANT A8W8 33.96 163.4 2.19X 82.4 2.20X INT4 RTN (G16W4 328) 425.1 0.84x 54.0 3.35x int8 + ag smoothquant a8w8 33.96 183.6 1.95x 24.8 7.30x

Table 1: Accuracy and latency measurements of the Intel 4th Gen Xeon Starcoder model

To load the resulting model and perform inference, you can replace the AutomodelFelforxxx class with the corresponding IPEXMODELFORXXX class in Optimum-Intel.

Before you begin, make sure you have all the libraries you need installed.

PIP Installation – Upgrade – Strategy Eaver Optimum (IPEX)

– Import Automodelforcausallm from transformers
+ Import ipexmodelforcausalllm from optimum.intel
Import Auto Token equipment, pipeline from transformers

-model = automodelforcausallm.from_pretrained(model_id)
+ model = ipexmodelforcausallm.from_pretrained(model_id)
Tokenizer = autotokenizer.from_pretrained(model_id) pipe = pipeline(“Text-generation”, model = model, tokenizer = tokenizer) results = pipe(“He is a scary magician and”)

versatileai

See Full Bio

What's Hot

The strategy behind the OpenAI Jalapeno Chip

Which tokens does the hybrid model predict better?

SAP aligns commerce data for AI personalization

The strategy behind the OpenAI Jalapeno Chip

Which tokens does the hybrid model predict better?

SAP aligns commerce data for AI personalization

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

Harness, scaffolding, and AI agent terminology worth getting right

Most Popular

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

Why Five Eyes spy agencies warn they will be hit by AI cyber threats this year

Harness, scaffolding, and AI agent terminology worth getting right

Don't Miss

The strategy behind the OpenAI Jalapeno Chip

Which tokens does the hybrid model predict better?

SAP aligns commerce data for AI personalization

Subscribe to Updates

What's Hot

Xeon’s best accelerates star coder with Intel: Q8/Q4 and speculative decoding

Step 1: Baselines and evaluations

LLM quantization

Step 2: 8-bit quantization (INT8)

Step 3: 4-bit quantization (INT4)

Different bottlenecks between generating the first token and subsequent token

Step 4: Assist Generation (AG)

Related Posts