Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

The UK deploys AI to increase Arctic security amid growing threats

May 28, 2025

Entertainment without boundaries: AI-Media and Lightning

May 27, 2025

Powerful ASR + Dialysis + Speculative decoding by endpoints of hugging hair facial inference

May 27, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, May 28
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Liger GRPO meets TRL
Tools

Liger GRPO meets TRL

versatileaiBy versatileaiMay 26, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email


Kashif Raslu's avatar










TL; Dr. Liger SuperCharges Group-Related Policy Optimization GRPO Trainers for TRLs by reducing memory usage by 40% with zero model quality degradation. It also added support for FSDP and PEFT, making it easier than ever to scale GRPO on multiple GPUs.

motivation

Fine-tuning language models using Renforcement Learning (RL) are key steps in the training life cycle of the model’s training model towards more complex desirable behaviors than can be achieved through typical monitored fine-tuning. RL has traditionally been applied to optimize large-scale language models (LLMS) using proximal policy optimization (PPO) algorithms. This approach, often related to reinforcement learning from human feedback (RLHF), utilizes individually trained reward models to guide tweaking of the primary model.

However, RLHF with PPO is a very resource-hunger approach. PPOs need to load multiple models into memory (policy, value, reward, reference model), and also require fine-tuned rewards and some iterations of the base model. The success of RLHF also depends on the ability of the reward model to effectively distinguish between desired and non-deterministic behavior from the model.

Group relative policy optimization (GRPO) has been highly popular recently, along with Deepseek’s R1 model. GRPO relies on a verifiable reward function that avoids the pre-trained reward models and value models used in RLHF, and instead allows you to check the correctness of the model’s output in a closed way without the need for an external reward model. This has resulted in a significant improvement when using GRPO instead of PPO to fine-tune in domains that can be easily validated to infer the model.

The following diagram shows the GRPO vs. PPO training pipeline (REF: DeepSeekMath Figure 4: Pushing up limits of mathematical inference in open language models):

PPO-VS-GRPO

That being said, RL training still eats a lot of GPU memory, so there’s still plenty of room for optimization here. In this blog post, we’ll be discussing the recent optimizations we’ve added to TRL that reduces peak memory usage by 40% during GRPO training. It also explains how to extend GRPO to multiple GPUs and nodes without losing performance or accuracy.

How the Liger kernel reduces GRPO memory

We extended the Liger Chunked Loss approach to GRPO losses. This eliminates the need to store full logit in memory for each training step. Calculation of logits containing the output head of the model contributes significantly to peak memory usage, especially when dealing with large vocabulary, long sequence lengths, or large batch sizes. Address this, chunking the input to LM_head across the batch, and running one chunk at a time.

However, if you just implement it simply, you cannot actually reduce peak memory because you need to maintain all the logits of the GPU memory in the backward pass. To avoid that, calculate the slope of each loss chunk (with respect to the input chunk and LM_head weight) during the forward pass and accumulate them as they pass through each chunk.

The Optimization Visualization (REF) is:

Liger-Chunked-Loss

Plug and Play Integration with TRL

Recently, I integrated Liger GRPO with PR#3184’s TRL, so please set USE_LIGER_LOSS to TRUE with GRPOCONFIG to enjoy memory savings.

Heads Up: These features are not yet the latest TRL release, so for now you will need to install TRL from Source.

Pip install “TRL (Liger) @git+https://github.com/huggingface/trl.git”

And you can use it like this:

from TRL Import grpoconfig, grpotrainer
from Dataset Import load_dataset train_dataset = load_dataset(“trl-lib/tldr”split =“train”training_args = grpoconfig(output_dir =“Qwen3-0.6b-grpo”,use_liger_loss =truth))

def reward_len(Finished, ** kwargs):
return (-Abs(20 – Ren(completion)) for completion in Complete) Trainer = grpotrainer (model =“Qwen/qwen3-0.6b-instruct”reward_funcs = reward_len, args = training_args, train_dataset = train_dataset, ) trainer.train()

benchmark

With or without Liger GRPO loss, we ran a lot of GRPO experiments to see how things compare. The policy model uses QWEN3-0.6B and plays with a variety of batch sizes. All experiments were performed on the GSM8K dataset using its reward function.

This is a plot of peak memory usage and batch size for both FP32 and BF16 training. As expected, the memory savings are better with larger batch sizes as they chunk along the batch dimension. So, when the batch size increases, the loss of liger chunks uses up to 40% less memory compared to the regular (non-liger) version.

Quick Note: Currently, we only support FP32, but we are working on open-sourced BF16 support for TRL’s Liger GRPO. The BF16 results shown here are from the internal patches we are testing.

mem-vs-batch-size-fp32

MEM-VS-BATCH-SIZE-BF16

It also shows that Liger’s losses are virtually accurate. As seen in the plot, the rewards for the training procedure remain roughly the same as those shown using the standard TRL implementation.

Reward vs-step

Scaling even more with FSDP and PEFT

We also added FSDP and PEFT support to the Liger GRPO losses in PR#3260 and PR#3355, allowing users to easily scale their experiments on multiple GPUs or nodes. PEFT technologies such as LORA and Qlora reduce the number of trainable parameters by adjusting only the weights of small adapters over the original model, and do not need to hold memory with a significant reduction in memory pressure as the gradient, activation and optimizer state of the overall model. Additionally, PEFT in GRPO allows you to load individual reference models during training. This is because simply disabling the LORA adapter allows you to get the original unchanged model during training.

Here we show a multi-GPU GRPO training plot using FSDP and PEFT. Here we compare the maximum possible training batch sizes, with or without Liger loss across different QWEN3 model sizes. I found that using Liger I was able to increase the batch size from about 1.5 to 1.8 times.

PEFT-Batch-Size-VS-Model-Size

Scaling further with VLLM

Liger loss can be effectively combined with TRL’s integrated VLLM server to accelerate text generation during training. This significantly accelerates the collection of rollout data with minimal overhead and provides a seamless, integrated experience.

Here’s how to set it up:

Start the VLLM server: First, start the VLLM server. This server handles generation requests from the training script. Open the device and run it.

cuda_visible_devices = 1 trl vllm-serve – model “Qwen/qwen3-0.6b”

Note: Assign cuda_visible_devices=1 to run the VLLM server on a specific GPU (in this case GPU 1) and train the other GPUs for free.

Configuring and Running Training Scripts: Next, modify the training script to use the VLLM server. An important change is to set use_vllm = true in grpoconfig.

from TRL Import grpoconfig, grpotrainer
from Dataset Import load_dataset

def reward_len(Finished, ** kwargs):
return (-Abs(20 – Ren(completion)) for completion in Finish) dataset = load_dataset(“trl-lib/tldr”split =“Train (: 1%)”training_args = grpoconfig(output_dir =“Qwen3-0.6b-grpo”,use_liger_loss =truth,use_vllm =truth,logging_steps =10
) trainer = grpotrainer (model =“Qwen/qwen3-0.6b”reward_funcs = reward_len, args = training_args, train_dataset = dataset,) trainer.train()

Launch Training: Finally, run the training script using Launch Accelerate (or Python if you don’t use Accelerate for Multi-GPU/Distributed Training). If it is occupied by a VLLM server, target another GPU for training.

CUDA_VISIBLE_DEVICES = 0 Accelerate raunch Train.py

(I want to run the training on GPU 0, assuming the script is named train.py).

By following these steps, you can leverage VLLM for faster generation turnarounds during GRPO training with Liger Loss.

Conclusion

Liger-GRPO has been integrated into TRL and the fine-tuning language model using GRPO along with FSDP and PEFT support is now more memory efficient and scalable than ever. We encourage the community to try out these new features and share feedback to further improve RL training in LLMS.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleCreate custom wall art using AI tools
Next Article These hidden provisions of the budget bill undermine our democracy.
versatileai

Related Posts

Tools

The UK deploys AI to increase Arctic security amid growing threats

May 28, 2025
Tools

Powerful ASR + Dialysis + Speculative decoding by endpoints of hugging hair facial inference

May 27, 2025
Tools

Gemini 2.5 update from Google Deepmind

May 27, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

The UAE announces bold AI-led plans to revolutionize the law

April 22, 20253 Views

The UAE will use artificial intelligence to develop new laws

April 22, 20253 Views

New report on national security risks from weakened AI safety frameworks

April 22, 20253 Views
Don't Miss

The UK deploys AI to increase Arctic security amid growing threats

May 28, 2025

Entertainment without boundaries: AI-Media and Lightning

May 27, 2025

Powerful ASR + Dialysis + Speculative decoding by endpoints of hugging hair facial inference

May 27, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?