As large-scale language models (LLM) and vision language models (VLMs) continue to grow in size and complexity, it becomes increasingly difficult to deploy them efficiently. Quantization provides a solution by reducing model size and inference delays. Intel’s AutoRound is here as a cutting-edge quantization tool that balances accuracy, efficiency and compatibility.
Autoround is a weight-only post-training quantization (PTQ) method developed by Intel. Uses signed gradient descent to jointly optimize weight rounding and clipping ranges, minimizing accuracy loss in most scenarios, allowing for accurate low-bit quantization (such as INT2 -INT8). For example, INT2 has up to 2.1 times more relative accuracy than typical baselines. The image below provides an overview of the core algorithm for auto-round. See our paper for more information.
Despite its powerful performance, the auto-round is fast and lightweight. Quantization of the 72B model takes just 37 minutes on an A100 GPU under light mode. It also supports mixed bit tuning, LM head quantization, GPTQ/AWQ/GGUF format exports, and flexible tuning recipes.
Low bit width and excellent accuracy
Autoround has very promising results, especially in low-bit quantization scenarios. Evaluation across different tasks shows that it is superior to the common method with a wide margin with 2-bit accuracy (source). At 4-bit, Autoround remains competitive for the most part, as demonstrated on low-bit open LLM leaderboards.
Average of 10 or more tasks on W2G128
Average 10 or more tasks in W4
2. Wide compatibility
Model
LLMS:Autoround supports almost all popular LLM architectures, including well-known models such as Qwen, Llama, and Deepseek. Ready to use quantized models are available for embracing faces through collections such as Opena, Kaitchup, and Fbaldassarri.
VLMS: Autoround supports over 10 Vision Language Models (VLMs), including Mistral-Small-3.1, Gemma3, and more. You can find the complete list in ReadMe and ready-to-use quantization models are available in the OPEA Hugging Face Collection. You can also apply the RTN method even for models that are not yet supported.
device
Quantization configuration
int8 weight only int4 weight int3 weight int2 weight only mixed bit weight only
Export format
Autoround GPTQ AWQ some gguf
3. Flexible/Efficient Quantization
Autoround has only 200 tuning steps and a small calibration data set (only 128 samples) (only 128 samples) for high accuracy. This efficiency leads to a lower resource consumption compared to faster quantization times and other more computationally intensive INT2 methods.
AutoAwq
Sample = 128
seqlen = 512
dataset = ‘pile’ autoawq
Sample = 512
seqlen = 2048
dataset = ‘pile’ gptq transfomers
samples =?
seqlen =?
dataset = ‘c4’ autoundlight
Sample = 128
seqlen = 2048
dataset = ‘pile-10k’ auto-round
Sample = 128
seqlen = 2048
dataset = ‘pile-10k’ auto-round
Sample = 512
seqlen = 2048
DataSet = ‘Pile-10k QWen2.5 3b 7min 17min 13min 3min 3min 8min 9min llama3.1-8b 13min 27min 22min 6min 13min Qwen2.5 72b 105min 230min OOM 37min 120min 149min
install
PIP installation auto round
Quantization and serialization
Currently, only offline mode is supported to generate quantized models.
Using the command line
auto-round \ – model qwen/qwen3-0.6b \ – bit 4 \ –group_size 128 \ – format “auto_round,auto_awq,auto_gptq” \ -output_dir ./TMP_AUTOROUND
Autoround also offers two other recipes, an automatic round vest and an automatic round light, designed for optimal accuracy and speed of improvement, respectively.
auto-round-best \ – model qwen/qwen3-0.6b \ –output_dir ./tmp_autoround
For 2-bit, we recommend using an auto-round vest or an auto-round. See the table below for a comparison of the three recipes.
Average accuracy and time cost results for W4G128 13 tasks (MMLU-PRO, IF_EVAL, GSM8K, etc.) (tested on NVIDIA A100 80G using Pytorch 2.6.0 version with Enable_Torch_Compile):
Model QWEN2.5-0.5B-INSTRUCT FALCON3-3B QWEN2.5-7B-INSTRUCT META-LLAMA-3.1-8B-INSTRUCT FALCON3-10B QWEN2.5-72B-INSTRUCT 16BITS 0.4192 0.5203 0.6470 0.6212 0.6151 0.7229 Best 0.4137 (7M) 0.5142 (23M) 0.6426 (58M) 0.6116 (65M) 0.6092 (81M) 0.7242 (575M) Default 0.4129 (2M) 0.5133 (6M) 0.6441 (13M) 0.6106 (13M) 0.6080 (2M) 0.7252 (118M) 0.5108 (3m) 0.6453 (5m) 0.6104 (6m) 0.6063 (6m) 0.7243 (37m)
Autoround API Usage
This setting provides a better trade-off between accuracy and tuning costs and is recommended in all scenarios.
from transformer Import Automodelforcausallm, AutoOtokenzer
from auto_round Import Autoround model_name = “Qwen/qwen3-0.6b”
Model = automodelforcausallm.from_pretrained(model_name)tokenizer = autotokenizer.from_pretrained(model_name)bits, group_size, sym = 4, 128, truth
Autoround = autoRound (model, tokenizer, bits, group_size = group_size, sym = sym,) output_dir = “./TMP_AUTOROUND”
autoround.quantize_and_save(output_dir, format=‘auto_round, auto_awq, auto_gptq’))
See Autoround ReadMe for best autoround/write settings for API use or mixed bit configuration
inference
Autoround automatically selects available backends based on installed libraries, prompting users to install additional libraries when a better backend is found. For more information, see HF ReadMe or Autoround ReadMe.
CPU/Intel GPU/CUDA
from transformer Import autorotelforcausallm, autotokenizer model_name = “OPEA/QWEN2.5-1.5B-INSTRUCT-INT4-SYM-INC”
Model = automodelforcausallm.from_pretrained(model_name, device_map =“Auto”torch_dtype =“Auto”) tokenizer = autotokenizer.from_pretrained(model_name)text = “There’s a girl who likes adventures.”
inputs = tokenizer(text, return_tensors =“PT”).to (model.device)
printing(tokenizer.decode(model.generate(** inputs, max_new_tokens =.)50do_sample =error) ()0)))
Convert GPTQ/AWQ to Auto-Round
Most GPTQ/AWQ models can be converted to an automatic round format for improved compatibility and support with Intel devices. Note that if the model is serialized, the quantization configuration will change.
from transformer Import autorotelforcausallm, autotokenizer, autoroundconfig model_name = “ybelkada/opt-125m-gptq-4bit”
Quantization_config = autoRoundConfig()Model = automodelforcausallm.from_pretrained(model_name, device_map =“CPU”torch_dtype =“Auto”,Quantization_config =Quantization_config)tokenizer =autotokenizer.from_pretrained(model_name)text = “There’s a girl who likes adventures.”
inputs = tokenizer(text, return_tensors =“PT”).to (model.device)
printing(tokenizer.decode(model.generate(** inputs, max_new_tokens =.)50do_sample =error) ()0)))
Autoround takes a meaningful step in post-training quantization of large-scale language and visual language models. By combining high precision, outstanding efficiency, and wide compatibility with popular models, devices and export formats, automatic rounds make low-bit quantization practical and powerful. Whether you’re deploying LLM at a large scale or experimenting with edge inference with VLMS, Autoround offers the tools and flexibility you need to achieve optimal performance with minimal overhead. We will push the boundaries of efficient AI deployments as you try it out, join a growing community.
Contributions to the Auto Round are welcome and welcome! Help is always appreciated, including bug fixes, documentation improvements, new features, and suggestions for improvements.
If you encounter problems with AutoRound, open the problem in the AutoRound repository.
Thank you for the open source low precision libraries that include AutoGPTQ, AutoOAWQ, GPTQModel, Triton, Marlin and Exllamav2 that Cuda Kernels is being used in AutoRound.