Liger GRPO meets TRL

TL; Dr. Liger SuperCharges Group-Related Policy Optimization GRPO Trainers for TRLs by reducing memory usage by 40% with zero model quality degradation. It also added support for FSDP and PEFT, making it easier than ever to scale GRPO on multiple GPUs.

motivation

Fine-tuning language models using Renforcement Learning (RL) are key steps in the training life cycle of the model’s training model towards more complex desirable behaviors than can be achieved through typical monitored fine-tuning. RL has traditionally been applied to optimize large-scale language models (LLMS) using proximal policy optimization (PPO) algorithms. This approach, often related to reinforcement learning from human feedback (RLHF), utilizes individually trained reward models to guide tweaking of the primary model.

However, RLHF with PPO is a very resource-hunger approach. PPOs need to load multiple models into memory (policy, value, reward, reference model), and also require fine-tuned rewards and some iterations of the base model. The success of RLHF also depends on the ability of the reward model to effectively distinguish between desired and non-deterministic behavior from the model.

Group relative policy optimization (GRPO) has been highly popular recently, along with Deepseek’s R1 model. GRPO relies on a verifiable reward function that avoids the pre-trained reward models and value models used in RLHF, and instead allows you to check the correctness of the model’s output in a closed way without the need for an external reward model. This has resulted in a significant improvement when using GRPO instead of PPO to fine-tune in domains that can be easily validated to infer the model.

The following diagram shows the GRPO vs. PPO training pipeline (REF: DeepSeekMath Figure 4: Pushing up limits of mathematical inference in open language models):

That being said, RL training still eats a lot of GPU memory, so there’s still plenty of room for optimization here. In this blog post, we’ll be discussing the recent optimizations we’ve added to TRL that reduces peak memory usage by 40% during GRPO training. It also explains how to extend GRPO to multiple GPUs and nodes without losing performance or accuracy.

How the Liger kernel reduces GRPO memory

We extended the Liger Chunked Loss approach to GRPO losses. This eliminates the need to store full logit in memory for each training step. Calculation of logits containing the output head of the model contributes significantly to peak memory usage, especially when dealing with large vocabulary, long sequence lengths, or large batch sizes. Address this, chunking the input to LM_head across the batch, and running one chunk at a time.

However, if you just implement it simply, you cannot actually reduce peak memory because you need to maintain all the logits of the GPU memory in the backward pass. To avoid that, calculate the slope of each loss chunk (with respect to the input chunk and LM_head weight) during the forward pass and accumulate them as they pass through each chunk.

The Optimization Visualization (REF) is:

Plug and Play Integration with TRL

Recently, I integrated Liger GRPO with PR#3184’s TRL, so please set USE_LIGER_LOSS to TRUE with GRPOCONFIG to enjoy memory savings.

Heads Up: These features are not yet the latest TRL release, so for now you will need to install TRL from Source.

Pip install “TRL (Liger) @git+https://github.com/huggingface/trl.git”

And you can use it like this:

from TRL Import grpoconfig, grpotrainer
from Dataset Import load_dataset train_dataset = load_dataset(“trl-lib/tldr”split =“train”training_args = grpoconfig(output_dir =“Qwen3-0.6b-grpo”,use_liger_loss =truth))

def reward_len(Finished, ** kwargs):
return (-Abs(20 – Ren(completion)) for completion in Complete) Trainer = grpotrainer (model =“Qwen/qwen3-0.6b-instruct”reward_funcs = reward_len, args = training_args, train_dataset = train_dataset, ) trainer.train()

benchmark

With or without Liger GRPO loss, we ran a lot of GRPO experiments to see how things compare. The policy model uses QWEN3-0.6B and plays with a variety of batch sizes. All experiments were performed on the GSM8K dataset using its reward function.

This is a plot of peak memory usage and batch size for both FP32 and BF16 training. As expected, the memory savings are better with larger batch sizes as they chunk along the batch dimension. So, when the batch size increases, the loss of liger chunks uses up to 40% less memory compared to the regular (non-liger) version.

Quick Note: Currently, we only support FP32, but we are working on open-sourced BF16 support for TRL’s Liger GRPO. The BF16 results shown here are from the internal patches we are testing.

It also shows that Liger’s losses are virtually accurate. As seen in the plot, the rewards for the training procedure remain roughly the same as those shown using the standard TRL implementation.

Scaling even more with FSDP and PEFT

We also added FSDP and PEFT support to the Liger GRPO losses in PR#3260 and PR#3355, allowing users to easily scale their experiments on multiple GPUs or nodes. PEFT technologies such as LORA and Qlora reduce the number of trainable parameters by adjusting only the weights of small adapters over the original model, and do not need to hold memory with a significant reduction in memory pressure as the gradient, activation and optimizer state of the overall model. Additionally, PEFT in GRPO allows you to load individual reference models during training. This is because simply disabling the LORA adapter allows you to get the original unchanged model during training.

Here we show a multi-GPU GRPO training plot using FSDP and PEFT. Here we compare the maximum possible training batch sizes, with or without Liger loss across different QWEN3 model sizes. I found that using Liger I was able to increase the batch size from about 1.5 to 1.8 times.

Scaling further with VLLM

Liger loss can be effectively combined with TRL’s integrated VLLM server to accelerate text generation during training. This significantly accelerates the collection of rollout data with minimal overhead and provides a seamless, integrated experience.

Here’s how to set it up:

Start the VLLM server: First, start the VLLM server. This server handles generation requests from the training script. Open the device and run it.

cuda_visible_devices = 1 trl vllm-serve – model “Qwen/qwen3-0.6b”

Note: Assign cuda_visible_devices=1 to run the VLLM server on a specific GPU (in this case GPU 1) and train the other GPUs for free.

Configuring and Running Training Scripts: Next, modify the training script to use the VLLM server. An important change is to set use_vllm = true in grpoconfig.

from TRL Import grpoconfig, grpotrainer
from Dataset Import load_dataset

def reward_len(Finished, ** kwargs):
return (-Abs(20 – Ren(completion)) for completion in Finish) dataset = load_dataset(“trl-lib/tldr”split =“Train (: 1%)”training_args = grpoconfig(output_dir =“Qwen3-0.6b-grpo”,use_liger_loss =truth,use_vllm =truth,logging_steps =10
) trainer = grpotrainer (model =“Qwen/qwen3-0.6b”reward_funcs = reward_len, args = training_args, train_dataset = dataset,) trainer.train()

Launch Training: Finally, run the training script using Launch Accelerate (or Python if you don’t use Accelerate for Multi-GPU/Distributed Training). If it is occupied by a VLLM server, target another GPU for training.

CUDA_VISIBLE_DEVICES = 0 Accelerate raunch Train.py

(I want to run the training on GPU 0, assuming the script is named train.py).

By following these steps, you can leverage VLLM for faster generation turnarounds during GRPO training with Liger Loss.

Conclusion

Liger-GRPO has been integrated into TRL and the fine-tuning language model using GRPO along with FSDP and PEFT support is now more memory efficient and scalable than ever. We encourage the community to try out these new features and share feedback to further improve RL training in LLMS.

versatileai

See Full Bio

What's Hot

Google Cloud reveals how AI is restructuring cybersecurity defenses

Humanoid robot performance wins Nvidia Tech boost

SDXL in 4 steps with potential consistency lora

Google Cloud reveals how AI is restructuring cybersecurity defenses

SDXL in 4 steps with potential consistency lora

Google Vids Gets AI Avatars and Inter-Images Tools

Understand the impact of top LLMs and AI on content creation — KHTS Radio — Santa Clarita Radio

Best AI Image Generation Bot Telegram

The UAE announces bold AI-led plans to revolutionize the law

Most Popular

Understand the impact of top LLMs and AI on content creation — KHTS Radio — Santa Clarita Radio

Best AI Image Generation Bot Telegram

The UAE announces bold AI-led plans to revolutionize the law

Don't Miss

Google Cloud reveals how AI is restructuring cybersecurity defenses

Humanoid robot performance wins Nvidia Tech boost

SDXL in 4 steps with potential consistency lora

Subscribe to Updates

What's Hot

motivation

How the Liger kernel reduces GRPO memory

Plug and Play Integration with TRL

benchmark

Scaling even more with FSDP and PEFT

Scaling further with VLLM

Conclusion

Related Posts