Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

The easiest repository to train VLMs with pure pytorch

May 21, 2025

VEO – Google Deep Mind

May 21, 2025

Gemini 2.5 update from Google Deepmind

May 21, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, May 21
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Improve your hug face training efficiency by packing Flash Anteresting 2
Tools

Improve your hug face training efficiency by packing Flash Anteresting 2

By March 17, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

The example training tuning using packed instructions (no padding) is compatible with Faching Face’s Flash Anteresting 2 thanks to the recent PR and new data collorators of flattening.

It can improve training throughput by up to twice as much while maintaining convergence quality. Read more!

introduction

The mini-batch padding input sequence is the usual way to match the inputs during training. However, this results in inefficiency due to unrelated padding tokens. Examples of packing without padding, and using token location information, are a more efficient alternative. However, when using Flash Note 2, previous packaging implementations did not consider boundary examples, which gave us undesirable cross-expel attention to reducing quality and convergence.

By embracing Face Transformers, we now address this with the introduction of a new data collator and a new feature that maintains boundary awareness during packaging.

By selecting a data collorator in flattening, you can hug the face trainer user, explaining sequence boundaries during Flash Atternest 2 calculations while seamlessly connecting the Shecence to a single tensor. This is accomplished through flash_attn_varlen_func and calculates the cumulative sequence length for each mini batch (cu_seqlens).

When calling Data Collator dataCollatorForcompletionOnlylm, you can hug Face Sfttrainer users of the TRL library with the TRL library by setting a new flag.

Increases throughput by up to twice as much

Using this feature, using the new data collorator, we can see a significant improvement in training throughput. The following diagram shows the throughput measured in tokens/sec during training. In this example, the throughput is the average per GPU above 8 A100-80 GPUs above two different indication tuning data sets, Flan and Orcamath, one epoch of one sample selected at 20K.

Flan has a short sequence on average, but the length of the sequence is large, so examples of length for each batch can vary widely. This means that padded flanbatches can cause significant overhead with unused padding tokens. Training on FLAN datasets demonstrates a great advantage using new data collorators that are flattened in terms of increased throughput. The models shown here have doubled throughput increases in Llama2-7B, Mistral-7B, and Granite-8B codes.

Orcamath has a longer example, with lower variance in length examples. Therefore, the improvement from packaging is low. Our experiments show that throughput increases by 1.4 times when training using this form of packaging on Orcamath datasets in these three models.

Memory

Memory usage is also improved with packaging with new data collocators. The following diagram shows the peak memory usage for the same three models trained on the same two datasets. Peak memory is reduced by 20% with Flan datasets, which offers significant benefits from packaging.

Peak memory reduction is 6% in the Orcamath dataset, with a more uniform example length.

Examples of packing can harm training convergence by reducing the number of optimization steps. However, this new feature holds a mini-batch, so it retains the same number of optimization steps as used in the padded example. Therefore, as seen in the following diagram, there is no effect on the convergence of the train. This shows the same validation loss for the same three model training on the same two datasets.

Valos

How it works

The four sequences are:

batch

After concatenating examples, the padding-free collator returns the input_ids, labels, and position_ids for each example. Therefore, this batch of data is provided by Collate.

example

The required fix is ​​lightweight and is limited to providing Position_IDS to Flash Attention22.

However, this depends on the model that exposes position_ids. At the time of writing, 14 models have published them and are supported by the solution. Specifically, Llama 2 and 3, Mistral, Mixtral, Granite, DBRX, Falcon, Gemma, Ormo, PHI 1, 2, and 3, PHI3, Qwen 2 and 2 MOE, Stablelm, and Starcoder 2 are all supported by the solution.

Get started

It’s easy to enjoy the benefits of packaging with Position_ids.

If you are using a face trainer that hugs from the transformers, you only need two steps.

Use a new data collorator with 2 flattening to instantiate the model with flash attention

If you are hugging Face Sfttrainer from TRL with datacollatorForcomportingOnlylm, here are two necessary steps:

Instantiate the model with Flash Atterness 2 set padding. TRUE when calling DataCollatorForcompletedionOnlylm like this:collator=dataCollatorforcompletedionOnlylm(respons_template_ids, tokenizer=tokenizer, padding_free=true)

How to use it

For trainer users, the following example shows how to use the new feature:

Import torch

from transformer Import automodelforcausallm model = automodelforcausallm.from_pretrained(
“instrubab/merlinite-7b-lab”torch_dtype = torch.bfloat16, attn_implementation =“flash_attention_2”
))

from Dataset Import load_dataset train_dataset = load_dataset(“JSON”data_files =“Path/to/my/dataset”) ()“train”))

from transformer Import dataCollator withflattening data_collator = datacollator withflattening()

from transformer Import TrainingArguments, Trainer Train_Args = TrainingArguments(output_dir =“/save/path”) Train.train()

For TRL users, the following example shows how to use new features in SFTTrainer:

Import torch
from transformer Import Automodelforcausallm, AutoOtokenzer
from Dataset Import load_dataset
from TRL Import sftconfig, sfttrainer, datacollatorforcomplecutiononlylm dataset = load_dataset(“lucasmccabe-lmi/codealpaca-20k”split =“train”)Model=automodelforcausallm.from_pretrained(
“instrubab/merlinite-7b-lab”torch_dtype = torch.bfloat16, attn_implementation =“flash_attention_2”) tokenizer = autotokenizer.from_pretrained(“instrubab/merlinite-7b-lab”) tokenizer.pad_token = tokenizer.eos_token

def formatting_prompts_func(example): output_texts =()
for I in range(Ren(example(‘instruction’)):text = f “###Question: {example(‘instruction’)(I)}\ n ###Answer: {example(‘output’)(I)}“
output_texts.append (text)
return output_texts response_template = ” ### answer:”
Response_template_ids = tokenizer.encode(respons_template, add_special_tokens =error) ()2đŸ™‚ collator = dataCollatorForcomplecutionOnlylm(response_template_ids, tokenizer = tokenizer, padding_free =truth) trainer = sfttrainer(model, train_dataset=dataset,args=sftconfig(output_dir=“./TMP”,gradient_checkpointing =truthper_device_train_batch_size =8
), formatting_func = formatting_prompts_func, data_collator = collator, )trainer.train()

Conclusion

Examples of packaging instructions adjustments are fully compatible with Flash Note 2, thanks to the new data collorators of recent PR and flattening, instead of padding. This method is compatible with models that use position_ids. The benefits are seen in throughput and peak memory usage during training, and do not degrade training convergence. Actual throughput and memory improvements depend on the distribution of the model and example lengths of training data. Training with data that greatly varies the length example has the greatest advantage in terms of padding by using a flattened data collorator. Setting a new flag when calling DataCollatorForcompletionOnlylm allows SFTTrainer users of the TRL library to be used by SFTTrainer users of the TRL library.

For a more detailed analysis, see the paper at https://huggingface.co/papers/2407.09105.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleReturning protein evolution to the origin of life
Next Article New MIT Sloan research suggests that AI is likely to complement human workers rather than replace them

Related Posts

Tools

The easiest repository to train VLMs with pure pytorch

May 21, 2025
Tools

Gemini 2.5 update from Google Deepmind

May 21, 2025
Tools

Artificial Analysis LLM Performance Leaderboard to hugging face

May 20, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Introducing walletry.ai – The future of crypto wallets

March 18, 20252 Views

Subscribe to Enterprise Hub with your AWS account

May 19, 20251 Views

The Secretary of the Ministry of Information will attend the closure of the AI ​​Media Content Training Program

May 18, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Introducing walletry.ai – The future of crypto wallets

March 18, 20252 Views

Subscribe to Enterprise Hub with your AWS account

May 19, 20251 Views

The Secretary of the Ministry of Information will attend the closure of the AI ​​Media Content Training Program

May 18, 20251 Views
Don't Miss

The easiest repository to train VLMs with pure pytorch

May 21, 2025

VEO – Google Deep Mind

May 21, 2025

Gemini 2.5 update from Google Deepmind

May 21, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?