Prioritized LLMS tuning using a direct priority optimization method

Supplementary note

After consulting with the author of the IPO paper, I discovered that the implementation of IPO in TRL was incorrect. In particular, losses to logarithmic liability for completion should be averaged instead of summing them. I added a fix to this PR and rerun the experiment. The results are currently in line with the paper, with IPOs comparable to DPOs and performing better than KTOs in paired configuration settings. I updated my post to reflect these new results.

tl; dr

We evaluate three promising ways to align language models without reinforcement learning (or preference adjustment) with many models and hyperparameter settings. In particular, we train using a variety of hyperparameters to evaluate:

introduction

In this post, we perform an empirical assessment of three promising LLM alignment algorithms: direct preference optimization (DPO), ID-preferred optimization (IPO), and Kahneman-Tversky Optimization (KTO). We conducted the experiments with two high quality 7B LLMS, despite having monitored fine-tuning steps but no prioritization alignment. Although some algorithms clearly outperform others, we can see that there are important hyperparameters that need to be adjusted to achieve the best results.

Alignment without reinforcement learning

Direct Preference Optimization (DPO) has emerged as a promising alternative to large-scale language models (LLMs) to suit human or AI preferences. Unlike traditional alignment methods based on reinforcement learning, DPO re-occurs the alignment formulation as a simple loss function that can be optimized directly on the dataset of the configuration ${(x, y_w, y_l)\}$

Sample of a priority tuning dataset.

This makes DPO practical and has been successfully applied to train models such as Zephyr and Intel’s NeuralChat.

The success of DPO has led researchers to develop new loss functions that generalize the methods in two main directions.

Robustness: One drawback of DPO is its tendency to rapidly over-adopt it in preferred datasets. To avoid this, Google Deepmind researchers have introduced ID preference optimization (IPO). This adds normalization terminology to DPO losses, allowing the model to be trained without the need for tricks such as early arrest. Distribute pair configuration data perfectly: Like most alignment methods, DPO requires a dataset of pair configuration ${(x, y_w, y_l)\}$

At the same time, these various methods come with hyperparameters. The most important method is $β\Beta$

This post aims to answer this question by performing an empirical analysis of three methods. Sweep key hyperparameters like this $β\Beta$

Please provide open source code to reproduce these results.

Let’s get started!

link

Important links related to the analysis are as follows:

Experiment setup

There are two main components that need to be considered when performing alignment experiments. The model and alignment dataset selected for optimization. To obtain more independent data points, we examined two models of OpenHermes-2.5-Mistral-7B and Zephyr-7B-Beta-Sft, and two alignment datasets Intel ORCA_DPO_PAIRS and UltraFeedback binary datasets.

In the first experiment, we used OpenHermes-2.5-Mistral-7B. This is because it is one of the best 7B parameter chat models that are not subject to alignment techniques. I then used Intel’s ORCA_DPO_PAIRS dataset. It consists of a 13K prompt where the selected response is generated by GPT-4 and the undesired response is generated by Llama-chat 13b. This is the dataset behind NeuralChat and NeuralHermes-2.5-Mistral-7B. KTO does not require pairwise preferences in itself, so it simply treats the level of the Llama-chat 13B as “bad” with the GPT-4 response as a “good” label. Although the response of GPT-4 may be preferred over Llama-chat 13b, we consider this to represent a few examples, although Llama-chat-13b may produce a better response.

In the second experiment, we performed a priority alignment of the Zephyr-7B-beta-SFT model. This includes a 66K prompt with a pair of selected and rejected responses. This dataset was used to train the original Zephyr model. This was the best in the class 7B model at the time with numerous automatic benchmarks and human ratings.

Experimental structure

The Alignment Handbook provides a simple way to organize a single experiment. These parameters are used to configure the run_dpo.py script.

model_name_or_path: Teknium/OpenHermes-2.5-Mistral-7B
torch_dtype: null

dataset_mixer:
Huggingfaceh4/orca_dpo_pairs: 1.0
dataset_splits:
– train_prefs
– test_prefs
preprocessing_num_workers: 12

BF16: truth
beta: 0.01
loss_type: sigmoid
do_eval: truth
do_train: truth
evaluation_strategy: Steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: truth
gradient_checkpointing_kwargs:
use_reintrant: error
hub_model_id: huggingfaceh4/openhermes-2.5-mistral-7b-dpo
hub_model_revision: v1.0

Learning_rate: 5.0E-7
logging_steps: 10
lr_scheduler_type: cosine
max_prompt_length: 512
num_train_epochs: 1
Best: adamw_torch
output_dir: Data/OpenHermes-2.5-Mistral-7B-DPO-V1.0
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
push_to_hub_revision: truth
save_strategy: “Step”
save_steps: 100
save_total_limit: 1
seed: 42
warmup_ratio: 0.1

I created a similar base configuration file for Zephyr experiments.

The chat template was automatically inferred from the Base Chat model, and OpenHermes-2.5 was used using CHATML format and Zephyr using the H4 chat template. Alternatively, if you use your own chat format, the 🤗Tonizer Library now enables user-defined chat templates using jinja-style strings.

“Message %}\n{%Message(‘role’)==’user’%}\n{{‘\n’ + message(‘content’) + eos_token}}\n{%elif message(‘role’)==’system’%}\n{‘\n’ + message(‘role’)==’assistant’%}\n{{‘\n’ + message(‘content’) + eos_token}}\n{%endif%}\n{%loop.last and add_generation_prompt%}\n{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

Format the conversation like this:

Hyperparameter sweep

I trained the DPO, IPO, and KTO methods via dpotrainer in loss_type argument TRL, and got the beta version from 0.01, 0.1, 0.2, …, 0.9. We included 0.01 as we observed that some alignment algorithms were particularly sensitive to this parameter. All experiments were trained for one era. All other hyperparameters are kept the same during each run, including random seeds.

I then launched a scan on the face cluster of hugs using the base configuration above. #gpurich

#! /bin/bash

configs =(“Zephyr” “OpenHermes”) loss_types =(“sigmoid” “kto_pair” “IPO”)Beta = (“0.01” “0.1” “0.2” “0.3” “0.4” “0.5” “0.6” “0.7” “0.8” “0.9”))

for config in “${configs(@)}“; do
for loss_type in “${loss_types(@)}“; do

for beta in “${betas(@)}“; do

job_name =“$config_${loss_type}_beta_${beta}“
model_revision =“${loss_type}–${beta}“

sbatch -job-name =${job_name} Recipes/launch.slurm dpo pref_align_scan config_$ config deepspeed_zero3 \\
” – Betta=${beta} -loss_type =${loss_type} -output_dir = data/$ config-7b-align-scan-${loss_type}-beta-${beta} -hub_model_revision =${model_revision}“
end
end
end

result

All models were evaluated using GPT-4 using the MT Bench, a multi-turn benchmark that judges the performance of the model in eight different categories: Writing, Roleplay, Inference, Mathematics, Coding, Extraction, STEM, and the Humanities. Although incomplete, the MT bench is a good way to evaluate LLM in conversation.

Zephyr-7b-beta-sft

Zephyr comparison

MT bench scores for Zephyr models for different $β\Beta$

In the Zephyr model, we observed that the best performance was achieved at the lowest $β\Beta$

Zephyr scan

Decomposition of the best Zephyr model for each algorithm across the MT bench category.

The best results for each algorithm can be decomposed across categories that MT Bench evaluates and identifies the advantages and disadvantages of these models. There is still a wide area to improve on the inference, coding and mathematics axis.

OpenHermes-7B-2.5

The observations on each algorithm remain the same in OpenHermes, but that is DPO>KTO>IPO, or sweet spot $β\Beta$

OpenHermes comparison

Because the MT bench scores of OpenHermes models are different $β\Beta$

The OpenHermes-7B-2.5 is clearly a more powerful base model, with only 0.3 improvement in MT bench score after priority alignment.

OpenHermes scan

Decomposition of the optimal OpenHermes model for each algorithm across the MT bench category.

Summary and insights

In this post, I highlighted the importance of selecting the appropriate hyperparameters when performing a priority alignment. We demonstrate empirically that DPO and IPO can achieve comparable results, and can outperform KTO in paired settings.

All code and configuration files that replicate these results are now available in the Alignment Handbook. This collection features the best performance models and datasets.

What’s next?

Implement the new Preference Alignment Algorithms in the TRL and continue to evaluate its performance. For the time being, at least for the time being, DPO appears to be the most robust and performant LLM alignment algorithm. KTO remains an interesting development as both DPO and IPO require pair preference data, whereas KTO can be applied to any dataset in which the response is evaluated positively or negatively.

We look forward to the new tools and techniques that will be developed in 2024!

versatileai

See Full Bio

What's Hot

Google AI tools accurately identify genetic causes of cancer

Run VLM on Intel CPUs in 3 easy steps

Epidemic Sound launches AI-powered soundtracking assistant for creators and brands

Google AI tools accurately identify genetic causes of cancer

Run VLM on Intel CPUs in 3 easy steps

Bringing AI to the next generation of fusion energy

🤗 Overview of quantization schemes natively supported by Transformers

Corteva, Profluent partners use AI to enable more resilient crops

Hugging Face and FriendliAI partner to enhance model deployment in hubs

Most Popular

🤗 Overview of quantization schemes natively supported by Transformers

Corteva, Profluent partners use AI to enable more resilient crops

Hugging Face and FriendliAI partner to enhance model deployment in hubs

Don't Miss

Google AI tools accurately identify genetic causes of cancer

Run VLM on Intel CPUs in 3 easy steps

Epidemic Sound launches AI-powered soundtracking assistant for creators and brands

Subscribe to Updates

What's Hot

Prioritized LLMS tuning using a direct priority optimization method

introduction

Alignment without reinforcement learning

link

Experiment setup

Experimental structure

Hyperparameter sweep

result

Zephyr-7b-beta-sft

OpenHermes-7B-2.5

Summary and insights

What’s next?

Related Posts