Pull your hair out as LLM tweaks are being taken forever? In this post, we’ll show you some lightweight tools developed by the community to make LLM tweaks so fast!
Before diving into Unsloth, you may be reading Qlora’s blog posts or familiar with LLM tweaks using the LLM PEFT library.
Unsloth -2x faster, -40% memory usage, 0% accuracy degradation
Unsloth is a lightweight library for faster LLM tweaks, fully compatible with the hugging hugging Face Ecosystem (Hub, Transformers, PEFT, TRL). The library is actively developed by the Unsloth team (Daniel and Michael) and the open source community. The library supports most NVIDIA GPUs from the GTX 1070 to the H100S and can be used throughout the trainer suite of TRL libraries (SFTTrainer, Dpotrainer, Ppotrainer). At the time of writing, Unsloth supports Llama (Codellama, Yi, etc.) and mistral architectures.
Unsloth works by overwriting some of the modeling code with optimized operations. By manually deriving backpropagation steps and rewriting all Pytorch modules to the Triton kernel, Unsloth can reduce memory usage and make tweaks faster. Importantly, the accuracy degradation for normal Qlora is 0% because the optimized code does not perform an approximation.
benchmark
1 A100 40GB Dataset 1.1b Alpaca 1x 1.55x 2.74x -57.8%dpo with Zephyr ultrachat 1x 1.24x 1.88x -11.6%
Free Colab T4 Data Set Zephyr Ultra Chat 1x 1.09x 1.55x -18.6%
Unsloth was benchmarked over 59 runs using four datasets of Tesla T4 and A100 Google Colab instances. Qlora was applied to all linear layers (caution and MLP) with 16 ranks, with gradient checkpoints turned on. With Pytorch 2.1.1, Unsloth is up to 2.7 times faster and uses up to 74% less memory by testing against the latest Transformers version (4.36) with SDPA integrated natively. I also tested Unsloth on a free Google Colab instance (Low RAM, 1 T4 GPU, Pytorch 2.1.0 CUDA 12.1). All 59 notebooks are provided to provide complete reproducibility, with details listed in Unsloth’s benchmark details.
How do I use Unsloth?
Simply load the Model with FastLanguageModel.from_pretrained! Currently, Unsloth supports Llama and Mistral Type architectures (YI, Deepseek, Tinyllama, Llamafed Qwen). If you want others, open up github issues! The latest Transformers main branch also allows you to directly load pre-quantified 4-bit models. This will make model downloads four times faster and reduce memory fragmentation by about 500MB. This allows for larger batches to be adapted. We have a few pre-quantized models for your convenience, including unsloth/llama-2-7b-bnb-4bit, unsloth/llama-2-13b-bnb-4bit, unsloth/mistral-7b-bnb-4bit and unsloth/codellama-34b-bnb-4bit.
You must provide the intended maximum sequence length for from_pretrained. Unsloth performs rope scaling internally, so it is automatically supported when the maximum sequence length is large. Otherwise, the API is almost the same as from_pretrained in Trans, except that fastlanguageModel.from_pretrained also returns a model token agent for convenience.
from I can’t sleep Import FastLanguageModel Model, tokenizer = fastLanguageModel.from_pretrained(model_name = “UNSLOTH/MISTRAL-7B-BNB-4BIT”,max_seq_length = 2048,load_in_4bit = truth,)
Once the model is loaded, install the adapter using fastLanguageModel.get_peft_model and perform the qlora fine tuning.
Model = fastLanguageModel.get_peft_model(model,r = 16target_modules=(“Q_Proj”, “k_proj”, “V_Proj”, “O_Proj”, “gate_proj”, “up_proj”, “down_proj”), lora_alpha = 16,lora_dropout = 0bias = “none”,use_gradient_checkpointing = truth,)
Once the adapter is connected, you can use the model directly within any class from the HF ecosystem, such as SFTTrainer in TRL.
Unsloth + TRL Integration
To use Unsloth in your TRL library, simply pass the Unsloth model to Sfttrainer or dpotrainer! The trained model is fully compatible with the hugging face ecosystem, so push the final model into the hub and use a transformer to get out of the box!
Import torch
from TRL Import sfttrainer
from transformer Import Training Argu
from Dataset Import load_dataset
from I can’t sleep Import fastlanguageModel max_seq_length = 2048
dataset = load_dataset(“IMDB”split =“train”) model, tokenizer = fastlanguageModel.from_pretrained(model_name = “UNSLOTH/MISTRAL-7B-BNB-4BIT”max_seq_length = max_seq_length, dtype = none,load_in_4bit = truth,)Model = fastLanguageModel.get_Peft_Model(Model,r = 16target_modules=(“Q_Proj”, “k_proj”, “V_Proj”, “O_Proj”,
“gate_proj”, “up_proj”, “down_proj”,), lora_alpha = 16,lora_dropout = 0bias = “none”,use_gradient_checkpointing = truthrandom_state = 3407max_seq_length = max_seq_length, )trainer = sfttrainer(model = model, train_dataset = dataset, dataset_text_field = “Sentence”max_seq_length = max_seq_length, tokenizer = tokenizer, args = trainingarguments(per_device_train_batch_size = 2,gradient_accumulation_steps = 4warmup_steps = 10,max_steps = 60,fp16 = do not have torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1output_dir = “output”,optimal = “adamw_8bit”seed = 3407,) )trainer.train()
Reproducible notebook
For those who want to try Unsloth with Sfttrainer on a free tier Google Colab instance, I share a fully reproducible notebook below.
Here’s an example of the llama 7b free Tesla T4 Colab
Here’s an example of the Mistral 7b free Tesla T4 Colab
Click here for an example of the Codellama 34b A100 Colab
Here’s an example of Zephyr DPO Replication T4 Colab