Integration of large-scale language models (LLMS) into training, especially in terms of memory efficiency and democratization of AI research, shows significant advances in the field of deep learning. By allowing training of 1 billion parameter models on consumer-grade hardware, reducing the optimizer state memory footprint, and leveraging advanced projection matrix technology, Galore opens new perspectives for researchers and practitioners who limit high-end computer resources.
Scaling LLM with consumer grade hardware
The ability to facilitate training models of up to 7 billion parameters, including parameters based on the Llama architecture, on consumer GPUs like the NVIDIA RTX 4090, is groundbreaking. This is achieved by significantly reducing the memory requirements traditionally related to optimizer state and gradient during the training process. This approach utilizes the inherent low-rank structure of gradients in deep neural networks and applies projections that reduce the dimensions of data that need to be stored and manipulated.
Memory efficiency in the optimizer state
In particular, the optimizer state of adaptive optimization algorithms such as Adam represents a significant portion of the memory footprint during model training. Galore addresses this by projecting the gradient into a low-dimensional subspace before it is processed by the optimizer. This not only reduces the memory required to store these states, but also maintains the effectiveness of the optimization process.
Memory savings are substantial, with the author reporting that “memory was reduced by more than 82.5% to store optimizer state during training,” allowing larger models to be trained within the same memory constraints, or larger batch sizes to be used. These savings become even more pronounced when combined with an 8-bit precision optimizer.
Subspace switching and advanced projection technology
A key element of Galore’s effectiveness is the dynamic subspace switching mechanism, which allows the model to navigate different low-rank subspaces throughout the training process. This ensures that the model is not limited to a limited portion of the parameter space and maintains the ability to fully parameter learning. Decisions about when and how to switch subspaces are crucial, and the frequency of these switches is a balance between maintaining a consistent optimized trajectory and adapting to evolving situations of low-rank structures of gradients.
The ability to dynamically adjust these projections in response to changes in gradient structures is a powerful tool in a rich arsenal, allowing you to more subtle control over the memory optimization trade-offs inherent to training large models.
Combine with a wide range of 8-bit optimized mothers
The combination of Galore and an 8-bit precision optimizer represents a synergistic effect that maximizes memory efficiency while maintaining the integrity and performance of the training process. The 8-bit optimizer reduces the memory footprint by quantizing the optimizer’s state. When used in conjunction with Galore’s projection mechanism, the result is a highly memory-efficient training regime that does not compromise model accuracy or convergence speed.
This combination is particularly effective in scenarios where memory is a critical bottleneck, such as training large models on consumer-grade hardware, or deploying models in memory-constrained environments. This uses more complex models and larger datasets within the same hardware constraints to push the boundaries of what can be achieved with limited resources.
Implementation details
Integrating an 8-bit optimizer for large-scale language model (LLMS) training involves quantizing gradients, weights, and optimizer states into an 8-bit representation. This quantization process significantly reduces memory footprint and allows for larger models to be trained and large batch sizes within the same memory constraints. The details of this integration algorithm include several important steps. Some of this will benefit greatly from the implementation of native CUDA for increased efficiency. Galore can quantize these techniques and special parameterization of the matrix to open up new possibilities for further integration of these techniques, leading to further reductions in memory usage. We are currently considering this direction with the BitsandBytes Library.
Overview of 8-bit optimization algorithms
Gradient Projection: Galore uses a projection matrix to project the full precision gradient into a low-rank subspace. This step reduces the gradient dimension and quantizes it into an 8-bit format.
Quantization: The predicted gradient, the weights and optimizer states of the model (such as Adam’s moving average) are quantized from a 32-bit floating point to an 8-bit integer representation. This involves scaling the floating point value into an 8-bit range and rounding it to the nearest integer.
Optimizer Update: Updates the weights of the model using an 8-bit quantized gradient. This procedure involves reverting the gradient to floating point format, applying optimizer update rules (such as Adam moment updates and parameter adjustments), reverting the updated optimizer state to 8-bit and reverting it to storage.
Quantification and Weight Update: 8-bit quantum weights undergo quantization in floating-point representations for processing, despite retaining the 8-bit precision inherent in quantized forms due to limited ranges of values. This step is necessary because standard operations in frameworks like Pytorch do not support 8-bit integers and such integer weights cannot accommodate gradients. Although this approach does not inherently improve accuracy, it facilitates practical applications and gradient calculations of quantized weights within the constraints of current deep learning libraries. Note that after quantification and before applying weight updates, Galore employs another prediction that predicts low-rank updates will return to their original space.
Use it by hugging the face trance
To use Galore Optimizers with a hagging face transformer library, you must first update to a version that supports Galore Optimizers by installing the latest update, i.e. PIP installation transformer >=4.39.0, or installing the transformer from source.
Next, install Galore-Torch by installing Galore-Torch library on PIP. Below is a rich complete example with transformers to assume Mistral-7B on an IMDB dataset.
Import torch
Import Dataset
from transformer Import Training Argu, Autoconfig, AutoTokenizer, Automodelforcausallum
Import TRL TRAIN_DATASET = DATASETS.LOOD_DATASET (“IMDB”split =‘train’)args = TrainingArguments(output_dir =“./Test-Galore”,max_steps =100per_device_train_batch_size =2,optimal =“Galore_adamw”,optime_target_modules =(“attn”, “MLP”)) model_id = “Mistralai/Mistral-7B-V0.1”
config = autoconfig.from_pretrained(model_id) tokenizer = autotokenizer.from_pretrained(model_id) model = automodelforcausallm.from_config(config).o(0) trainer = trl.sfttrainer(model = model, args = args, train_dataset = train_dataset, dataset_text_field =‘Sentence’,max_seq_length =512,)trainer.train()
TrainingArguments: Simply pass valid optime_target_modules (supports a single string, regular expression, or list of strings or regular expressions) and a valid galore optimizer such as galore_adamw, galore_adamw_8bit, galore_adafactor, etc.
Updates by layer
Another important point to mention is the per-layer optimizer (i.e. it updates one layer weight at a time). Typically, the optimizer performs a single weight update for all layers after backpropagation. This is done by storing the entire weight gradient in memory. By adopting layer-by-layer weight updates, you can further reduce your memory footprint during training. Under the hood, this is implemented with a post-accumulation hook after Pytorch on the layer the user wants to update.
To use this feature, simply add _layerwise to the optimizer name, such as galore_adamw_layerwise.
Conclusion
Galore features an innovative approach to exploiting low-gradation rank structures, representing an important advance in memory-efficient training for LLMS. By enabling training of 1 billion parameter models on consumer-grade hardware, reducing the optimizer state memory footprint through projection technology, and enabling dynamic subspace switching, Galore democratizes access to large-scale model training. The rich compatibility with the 8-bit precision optimizer further enhances its utility and provides a path to training larger and more complex models without the need for specialized computational resources. This opens up new possibilities for research and application in AI, and is an exciting time for practitioners and researchers.
resource
See the original paper. Twitter reference: 1 23. This paper also draws a comparison between Galore and Relora. This may be interesting for some readers. Feel free to join the author’s Slack community, especially for readers with unanswered questions from those who want to discuss the results constructively. For anyone interested in further releases along these lines, follow Jiawei Zhao and Titus von Koeller (for more information about the latest Bitsandbytes releases), and Younes Belkada.