Quantization is a technique that reduces computational and memory costs of evaluating deep learning models by using low-precision data types such as 8-bit integers (INT8) instead of the regular 32-bit floating point (FLOAT32) to represent weights and activations.
Reducing the number of bits means that the resulting model requires less memory storage. This is important for deploying large-scale language models on consumer devices. It also allows for specific optimizations for lower bit-width data types, such as INT8 and FLOAT8 matrix multiplication of CUDA devices.
Many open source libraries are available to quantize Pytorch deep learning models. Each offers very powerful features, but is limited to specific model configurations and devices.
They are also based on the same design principles, but unfortunately they are often not compatible with each other.
Today we look forward to introducing Quanto, the perfect Pytorch Quantization backend.
It is designed with versatility and simplicity in mind.
All features can be used in enthusiastic mode (works with non-trackable models), place quantized models in any device (including CUDA and MPS), insert quantized and destabilizing stubs automatically, insert quantized functional operations automatically, see the list of supported modules). pytorch weight_only and 🤗 Safetencer, CUDA device acceleration matrix multiplication (INT8-INT8, FP16-INT4, BF16-INT8, BF16-INT4) supports INT2, INT4, INT8 and FLOAT8 weights, and INT8 and FLOAT8 activation.
While recent quantization methods appear to focus on quantization of large-scale language models (LLMS), Quanto intends to provide very simple quantization primitives for simple quantization schemes (linear quantization, group-by-group quantization) that can be adapted across all modalities.
Quantization Workflow
Quanto can be used as a PIP package.
PIP Install Optimum-Quanto
A typical quantization workflow consists of the following steps:
1. Quantization
The first step is to transform a standard float model into a dynamically quantized model.
from optimum.quanto Import Quantize, Qint8 Quantize (Model, Weights = QINT8, Activations = QINT8)
At this stage, only the model inference is changed and the weights are dynamically quantized.
2. Calibration (optional if activation is not quantized)
Quanto supports a calibration mode that allows you to record the range of activation while passing typical samples to a Quantized model.
from optimum.quanto Import calibration
and Calibration (Momentum =0.9): Model (sample)
This automatically activates quantization of quantized module activation.
3. Tune, alias Quantization-Aware-Training (optional)
If the performance of the model is degraded too much, you can adjust some epochs to recover the performance of the float model.
Import Torch Model.Train()
for batch_idx, (data, target) in I’ll list it(train_loader): data, target=data.to(device), target.to(device) optimizer.zero_grad() output=model(data).dequantize() loss=torch.nn.functional.nll_loss(output, target) loss.backward() optimizer.step() optimizer.step()
4. Freeze integer weights
When you freeze a model, the weight of that float is replaced by quantized weights.
from optimum.quanto Import Freeze Freeze (Model)
5. Serialize quantized models
The weights of the quantized model can be serialized to State_Dict and saved to a file. Both pickles and safeteners (recommended) are supported.
from safetensors.torch Import save_file save_file(model.state_dict(), “Model.SafeTensors”))
To reload these weights, you must also save the quantization map of the quantization model.
Import JSON
from optimum.quanto Import Quantization_map
and open(‘Quantization_map.json’lol) As F: json.dump (Quantization_map (model))
5. Reload the quantized model
Serialized quantized models can be reloaded from State_Dict and Quantization_Map using ReceAntize Helper. Note that you need to instantiate an empty model first.
Import JSON
from safetensors.torch Import load_file state_dict = load_file(“Model.SafeTensors”))
and open(‘Quantization_map.json’r) As F: Quantization_map = json.load(f)
and torch.device (“Meta”): new_model = …ratence(new_model, state_dict, quantization_map, device = torch.device(“cuda”)))
See instant example of quantization workflow. You can also check out this notebook and show you how to Quantize your Bloom Model with Quanto.
performance
Below are two graphs that evaluate the accuracy of different quantized configurations of Metalama/Metalama-3.1-8b.
Note: The first bar in each group always corresponds to an unquantified model.
These results are obtained without applying post-training optimization algorithms such as HQQ or AWQ.
The graph below shows the delay per talk measured on an NVIDIA A10 GPU.

Stay tuned for the latest results, we are constantly improving Quanto using optimizer and optimized kernel.
See Quanto Benchmarks for detailed results for the various model architectures and configurations.
Transformer Integration
Quanto is seamlessly integrated into a hugging face transformer library. You can quantize any model by passing QuantoConfig to from_pretrained!
Currently, you must use the latest version of Accelerate to ensure that your integration is fully compatible.
from transformer Import autorotelforcausallm, autotokenizer, quantoconfig model_id = “Facebook/Opt-125m”
tokenizer = autotokenizer.from_pretrained(model_id)quantization_config = quantoconfig(weights=“int8”) Quantized_model = automodelforcausallm.from_pretrained(model_id, Quantization_config = Quantization_config)
You can quantize the weights and activations of INT8, FLOAT8, INT4, or INT2 simply by passing the correct arguments in QuantoConfig. Activation can be either INT8 or FLOAT8. For FLOAT8, hardware compatible with Float8 Precision is required. Otherwise, Quanto quietly upcasts the weights and activations to Torch.Float32 or Torch.Float16 (depending on the original data type of the model) when running Matmul (only if the weight is quantized). When I try to use Float8 using an MPS device, Torch currently receives an error.
Quanto is Device Agnostic. This means that if you are using CPU/GPU/MPS (Apple Silicon), you can quantize and run the model.
Quanto is also Torch.compile friendly. You can use Quanto to Quantize the model and call Torch.comPile to the model and compile it for a faster generation. This feature may not be out of the box if dynamic quantization is involved (i.e. quantization recognition training or quantized activation is enabled). Keep activation when creating QuantoConfig in case you use trans integration = none.
You can also use Quanto to Quantize any model, regardless of modality! Here we show how to quantize an OpenAI/Whisper-Large-V3 model with INT8 using Quanto.
from transformer Import autorotelforspeeheq2seq model_id = “Openai/Whisper-Large-V3”
Quanto_config = quantoconfig(weights =“int8”)Model = autorotelforspeecheq2seq.from_pretrained(model_id, torch_dtype = torch.float16, device_map =“cuda”,Quantization_config =quanto_config)
Check out this notebook for a complete tutorial on how to properly use Quanto in transformer integration!
Contributing to Quanto
Contributions to Quanto are highly welcome, especially in the following areas:
Target the optimized kernel for Quanto operations targeting a specific device, the quantification optimizer after training, and recover the accuracy lost during quantization, the transformer or diffuser model helper classes.