The example training tuning using packed instructions (no padding) is compatible with Faching Face’s Flash Anteresting 2 thanks to the recent PR and new data collorators of flattening.
It can improve training throughput by up to twice as much while maintaining convergence quality. Read more!
introduction
The mini-batch padding input sequence is the usual way to match the inputs during training. However, this results in inefficiency due to unrelated padding tokens. Examples of packing without padding, and using token location information, are a more efficient alternative. However, when using Flash Note 2, previous packaging implementations did not consider boundary examples, which gave us undesirable cross-expel attention to reducing quality and convergence.
By embracing Face Transformers, we now address this with the introduction of a new data collator and a new feature that maintains boundary awareness during packaging.
By selecting a data collorator in flattening, you can hug the face trainer user, explaining sequence boundaries during Flash Atternest 2 calculations while seamlessly connecting the Shecence to a single tensor. This is accomplished through flash_attn_varlen_func and calculates the cumulative sequence length for each mini batch (cu_seqlens).
When calling Data Collator dataCollatorForcompletionOnlylm, you can hug Face Sfttrainer users of the TRL library with the TRL library by setting a new flag.
Increases throughput by up to twice as much
Using this feature, using the new data collorator, we can see a significant improvement in training throughput. The following diagram shows the throughput measured in tokens/sec during training. In this example, the throughput is the average per GPU above 8 A100-80 GPUs above two different indication tuning data sets, Flan and Orcamath, one epoch of one sample selected at 20K.
Flan has a short sequence on average, but the length of the sequence is large, so examples of length for each batch can vary widely. This means that padded flanbatches can cause significant overhead with unused padding tokens. Training on FLAN datasets demonstrates a great advantage using new data collorators that are flattened in terms of increased throughput. The models shown here have doubled throughput increases in Llama2-7B, Mistral-7B, and Granite-8B codes.
Orcamath has a longer example, with lower variance in length examples. Therefore, the improvement from packaging is low. Our experiments show that throughput increases by 1.4 times when training using this form of packaging on Orcamath datasets in these three models.
Memory usage is also improved with packaging with new data collocators. The following diagram shows the peak memory usage for the same three models trained on the same two datasets. Peak memory is reduced by 20% with Flan datasets, which offers significant benefits from packaging.
Peak memory reduction is 6% in the Orcamath dataset, with a more uniform example length.
Examples of packing can harm training convergence by reducing the number of optimization steps. However, this new feature holds a mini-batch, so it retains the same number of optimization steps as used in the padded example. Therefore, as seen in the following diagram, there is no effect on the convergence of the train. This shows the same validation loss for the same three model training on the same two datasets.
How it works
The four sequences are:
After concatenating examples, the padding-free collator returns the input_ids, labels, and position_ids for each example. Therefore, this batch of data is provided by Collate.
The required fix is ​​lightweight and is limited to providing Position_IDS to Flash Attention22.
However, this depends on the model that exposes position_ids. At the time of writing, 14 models have published them and are supported by the solution. Specifically, Llama 2 and 3, Mistral, Mixtral, Granite, DBRX, Falcon, Gemma, Ormo, PHI 1, 2, and 3, PHI3, Qwen 2 and 2 MOE, Stablelm, and Starcoder 2 are all supported by the solution.
Get started
It’s easy to enjoy the benefits of packaging with Position_ids.
If you are using a face trainer that hugs from the transformers, you only need two steps.
Use a new data collorator with 2 flattening to instantiate the model with flash attention
If you are hugging Face Sfttrainer from TRL with datacollatorForcomportingOnlylm, here are two necessary steps:
Instantiate the model with Flash Atterness 2 set padding. TRUE when calling DataCollatorForcompletedionOnlylm like this:collator=dataCollatorforcompletedionOnlylm(respons_template_ids, tokenizer=tokenizer, padding_free=true)
How to use it
For trainer users, the following example shows how to use the new feature:
Import torch
from transformer Import automodelforcausallm model = automodelforcausallm.from_pretrained(
“instrubab/merlinite-7b-lab”torch_dtype = torch.bfloat16, attn_implementation =“flash_attention_2”
))
from Dataset Import load_dataset train_dataset = load_dataset(“JSON”data_files =“Path/to/my/dataset”) ()“train”))
from transformer Import dataCollator withflattening data_collator = datacollator withflattening()
from transformer Import TrainingArguments, Trainer Train_Args = TrainingArguments(output_dir =“/save/path”) Train.train()
For TRL users, the following example shows how to use new features in SFTTrainer:
Import torch
from transformer Import Automodelforcausallm, AutoOtokenzer
from Dataset Import load_dataset
from TRL Import sftconfig, sfttrainer, datacollatorforcomplecutiononlylm dataset = load_dataset(“lucasmccabe-lmi/codealpaca-20k”split =“train”)Model=automodelforcausallm.from_pretrained(
“instrubab/merlinite-7b-lab”torch_dtype = torch.bfloat16, attn_implementation =“flash_attention_2”) tokenizer = autotokenizer.from_pretrained(“instrubab/merlinite-7b-lab”) tokenizer.pad_token = tokenizer.eos_token
def formatting_prompts_func(example): output_texts =()
for I in range(Ren(example(‘instruction’)):text = f “###Question: {example(‘instruction’)(I)}\ n ###Answer: {example(‘output’)(I)}“
output_texts.append (text)
return output_texts response_template = ” ### answer:”
Response_template_ids = tokenizer.encode(respons_template, add_special_tokens =error) ()2🙂 collator = dataCollatorForcomplecutionOnlylm(response_template_ids, tokenizer = tokenizer, padding_free =truth) trainer = sfttrainer(model, train_dataset=dataset,args=sftconfig(output_dir=“./TMP”,gradient_checkpointing =truthper_device_train_batch_size =8
), formatting_func = formatting_prompts_func, data_collator = collator, )trainer.train()
Conclusion
Examples of packaging instructions adjustments are fully compatible with Flash Note 2, thanks to the new data collorators of recent PR and flattening, instead of padding. This method is compatible with models that use position_ids. The benefits are seen in throughput and peak memory usage during training, and do not degrade training convergence. Actual throughput and memory improvements depend on the distribution of the model and example lengths of training data. Training with data that greatly varies the length example has the greatest advantage in terms of padding by using a flattened data collorator. Setting a new flag when calling DataCollatorForcompletionOnlylm allows SFTTrainer users of the TRL library to be used by SFTTrainer users of the TRL library.
For a more detailed analysis, see the paper at https://huggingface.co/papers/2407.09105.