Reinforcement learning from human feedback (RLHF) has become the de facto last training step for LLMs such as GPT-4 and Claude, ensuring that the language model’s output matches human expectations such as talkativeness and safety features. However, some of the complexity of RL is brought over to NLP. We need to build an appropriate reward function and train the model to estimate the values of the states. At the same time, you must be careful not to stray too far from the original model and produce gibberish rather than meaningful text. Processes like this are very complex and require many intricate moving parts, so it’s not always easy to get things right.
A recent paper Direct Preference Optimization by Rafailov, Sharma, Mitchell et al. proposes casting the RL-based goals used in existing methods into goals that can be directly optimized via a simple binary cross-entropy loss. This greatly simplifies this process of refining your LLM.
In this blog post, we introduce the Direct Preference Optimization (DPO) method now available in the TRL library and show how to fine-tune the latest Llama v2 7B parameter model on the Stack Exchange Preference dataset containing ranked answers to questions on various Stack Exchange portals.
DPO vs. PPO
In traditional models of optimizing human-derived preferences via RL, the goto method is to use an auxiliary reward model and fine-tune the target model to maximize this given reward via the mechanisms of RL. Intuitively, you want to use the reward model to provide feedback to the optimizing model so that it produces high-reward samples more often and low-reward samples less often. At the same time, a frozen reference model is used to ensure that what is produced does not deviate too much and continues to maintain generational diversity. This is typically done by adding a KL penalty to the full reward maximization objective via a referenced model, which prevents the model from learning to cheat or abuse the reward model.
The DPO formulation bypasses the reward modeling step and directly optimizes the linguistic model of preference data via key insights. In other words, the analytical mapping from the reward function to the optimal RL policy allows the authors to directly convert the RL loss for the reward model and the reference model into the loss for the reference model. This mapping intuitively measures how well a given reward function matches given preference data. Therefore, DPO starts from the optimal solution for the RLHF loss and derives the reference model-only loss through variable changes.
Therefore, this direct likelihood objective can be optimized without requiring a reward model or performing potentially cumbersome RL-based optimization.
Training method using TRL
As mentioned earlier, an RLHF pipeline typically consists of the following different parts:
Supervised fine-tuning (SFT) step Process of annotating data with preference labels Training reward model with preference data and RL optimization step
Although the TRL library comes with helpers for all of these parts, DPO training eliminates the task of reward modeling and RL (steps 3 and 4) and directly optimizes DPO objects based on preference annotated data.
At this point, you need to perform step 1, but instead of steps 3 and 4, you need to provide TRL’s DPOTrainer with the configuration data from step 2 in a very special format: a dictionary containing the following three keys:
Prompts This consists of contextual prompts given to the model during inference for text generation. If selected, contains the recommended response generated for the corresponding prompt. If rejected, it contains a response that is not recommended or should not be a sampled response for the given prompt.
As an example, for a stack exchange configuration pairs dataset, you can map the dataset entries to return the desired dictionary and remove all original columns via the following helper:
surely return_prompt_and_responses(sample) -> dictionary(str, str, str):
return {
“prompt”🙁
“question: “ +Question+ “\n\nAnswer: “
for question in sample(“question”)),
“Chosen”: Sample(“Response_j”),
“Rejected”: Sample(“Response_k”), } dataset =load_dataset(
“lvwerra/stack exchange pairing”split =“train”data directory=“data/rl”
)original_columns = dataset.column_names dataset.map( return_prompt_and_responses, batched=truthremove_columns=original_columns )
At a high level, DPOTrainer requires a base model and a reference model that you want to optimize, since once the dataset is sorted, the DPO loss is essentially a supervised loss that obtains implicit rewards via the referenced model.
dpo_trainer = DPOTrainer( model, model reference, beta =0.1train_dataset=dataset, tokenizer=tokenizer, args=training_args,)
Here, the beta hyperparameter is the temperature parameter of the DPO loss, which is typically in the range of 0.1 to 0.5. This controls how much attention is paid to the referenced model in the sense that the smaller the beta, the more it ignores the referenced model. After initializing the trainer, you can train it on the dataset using the specified training_args by simply calling:
dpo_trainer.train()
Try Llama v2
The advantage of implementing the DPO trainer in TRL is that you can take advantage of all the additional training features for large-scale LLMs that come with TRL and its dependent libraries (such as Peft and Accelerate). These libraries also allow you to train Llama v2 models using the QLoRA techniques provided by the bitsandbytes library.
Supervised fine-tuning
The process introduced above includes a supervised fine-tuning step using QLoRA on the 7B Llama v2 model with an SFT split of the data via TRL’s SFTTrainer.
bnb_config = BitsAndBytesConfig(load_in_4bit=truth,bnb_4bit_quant_type=“nf4”bnb_4bit_compute_dtype=torch.bfloat16, )base_model = AutoModelForCausalLM.from_pretrained( script_args.model_name, quantization_config=bnb_config, device_map={“”: 0}, trust_remote_code=truthuse_auth_token=truth)base_model.config.use_cache = error
peft_config = LoraConfig( r=script_args.lora_r, lora_alpha=script_args.lora_alpha, lora_dropout=script_args.lora_dropout, target_modules=(“q_project”, “v_project”), bias =“none”task type=“CAUSAL_LM”) …trainer = SFTTrainer(model=base_model, train_dataset=train_dataset, eval_dataset=eval_dataset, peft_config=peft_config,Packing=truthmax_seq_length=nonetokenizer=tokenizer, args=training_args, )trainer.train()
DPO training
Once the SFT is complete, you can save the resulting model and proceed to DPO training. As is customary, we utilize the model saved from the previous SFT step as both the base model and the reference model for the DPO. These can then be used to train a model with the stack exchange preferred data DPO objective shown above. Since the model was trained via the LoRa adapter, we load the model via Peft’s AutoPeftModelForCausalLM helper.
model = AutoPeftModelForCausalLM.from_pretrained( script_args.model_name_or_path, low_cpu_mem_usage=truthtorch_dtype=torch.float16, load_in_4bit=truthis_trainable=truth) model_ref = AutoPeftModelForCausalLM.from_pretrained( script_args.model_name_or_path, low_cpu_mem_usage=truthtorch_dtype=torch.float16, load_in_4bit=truth) … dpo_trainer = DPOTrainer(model, model_ref, args=training_args, beta=script_args.beta, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, peft_config=peft_config, ) dpo_trainer.train() dpo_trainer.save_model()
As you can see, we load the model with a 4-bit configuration and train it with the QLora method via the peft_config argument. The trainer also evaluates progress during training with respect to the evaluation dataset and reports many important metrics, such as implicit rewards, which can be recorded and viewed via WandB. You can then push your final trained model to HuggingFace Hub.
conclusion
The complete source code for the SFT and DPO training scripts can be found in the example/stack_llama_2 directory below, and the model trained using the merged adapter can be found on the HF hub here.
WandB logs for DPO training runs can be found here. During training and evaluation, DPOTrainer records the following reward metrics:
Reward/Selection: The average difference in the log probabilities of the policy model and the reference model of a selected response, scaled by beta. Reward/Rejection: The average difference in the log probabilities of the policy model and the reference model of a rejected response, scaled by beta. Reward/Accuracy: The average frequency at which a selected reward is higher than the corresponding rejected reward. Reward/Margin: The average difference between a selected reward and a corresponding rejected reward.
Intuitively, it would be desirable for the margin to increase during training so that the accuracy is 1.0, i.e. the selected reward is higher than the rejected reward (or the margin is greater than zero). These metrics can be calculated on some evaluation dataset.
We hope that the release of the code will make it easier for readers to try this method of aligning large language models on their own datasets. I can’t wait to see what you build. If you want to try the model yourself, you can run it in trl-lib/stack-llama.

