Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Google AI tools accurately identify genetic causes of cancer

October 18, 2025

Run VLM on Intel CPUs in 3 easy steps

October 18, 2025

Epidemic Sound launches AI-powered soundtracking assistant for creators and brands

October 17, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, October 18
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»Prioritized LLMS tuning using a direct priority optimization method
Tools

Prioritized LLMS tuning using a direct priority optimization method

versatileaiBy versatileaiAugust 5, 2025No Comments9 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Supplementary note

After consulting with the author of the IPO paper, I discovered that the implementation of IPO in TRL was incorrect. In particular, losses to logarithmic liability for completion should be averaged instead of summing them. I added a fix to this PR and rerun the experiment. The results are currently in line with the paper, with IPOs comparable to DPOs and performing better than KTOs in paired configuration settings. I updated my post to reflect these new results.

tl; dr

We evaluate three promising ways to align language models without reinforcement learning (or preference adjustment) with many models and hyperparameter settings. In particular, we train using a variety of hyperparameters to evaluate:

introduction

In this post, we perform an empirical assessment of three promising LLM alignment algorithms: direct preference optimization (DPO), ID-preferred optimization (IPO), and Kahneman-Tversky Optimization (KTO). We conducted the experiments with two high quality 7B LLMS, despite having monitored fine-tuning steps but no prioritization alignment. Although some algorithms clearly outperform others, we can see that there are important hyperparameters that need to be adjusted to achieve the best results.

Alignment without reinforcement learning

Direct Preference Optimization (DPO) has emerged as a promising alternative to large-scale language models (LLMs) to suit human or AI preferences. Unlike traditional alignment methods based on reinforcement learning, DPO re-occurs the alignment formulation as a simple loss function that can be optimized directly on the dataset of the configuration {(x, yw, yl)} \{(x, y_w, y_l)\} {(x,ywになったんです。 English: The first thing you can do is to find the best one to do.,ylになったんです。 English: The first thing you can do is to find the best one to do.)}where xxx It’s a prompt YW, YLY_W, Y_Lywになったんです。 English: The first thing you can do is to find the best one to do.,ylになったんです。 English: The first thing you can do is to find the best one to do. It is a favorable response and distribution.

Sample of a priority tuning dataset.

This makes DPO practical and has been successfully applied to train models such as Zephyr and Intel’s NeuralChat.

The success of DPO has led researchers to develop new loss functions that generalize the methods in two main directions.

Robustness: One drawback of DPO is its tendency to rapidly over-adopt it in preferred datasets. To avoid this, Google Deepmind researchers have introduced ID preference optimization (IPO). This adds normalization terminology to DPO losses, allowing the model to be trained without the need for tricks such as early arrest. Distribute pair configuration data perfectly: Like most alignment methods, DPO requires a dataset of pair configuration {(x, yw, yl)} \{(x, y_w, y_l)\} {(x,ywになったんです。 English: The first thing you can do is to find the best one to do.,ylになったんです。 English: The first thing you can do is to find the best one to do.)}annotators label which responses are superior according to a set of criteria such as usefulness and harm. In reality, creating these datasets is a time-consuming and expensive effort. Contextualai recently proposed an interesting alternative called Kahneman-Tversky Optimization (KTO). This fully defines the loss function in terms of individual examples labeled “good” or “bad” (for example, 👍 or 👎 icons seen in the chat UI). These labels are actually much easier to get, and KTO is a promising way to continuously update chat models running in production environments.

At the same time, these various methods come with hyperparameters. The most important method is β\Beta Betacontrols the amount of weighting the preference of the reference model. These alternatives have made them available in practitioners’ arsenals through libraries such as 🤗TRL. The natural question is what among these methods and hyperparameters would be the one that generates the best chat model?

This post aims to answer this question by performing an empirical analysis of three methods. Sweep key hyperparameters like this β\BetaBeta Perform the training procedure and evaluate the performance of the resulting model through the MT bench. This is a general benchmark for measuring chat model functionality.

Please provide open source code to reproduce these results.

Let’s get started!

link

Important links related to the analysis are as follows:

Experiment setup

There are two main components that need to be considered when performing alignment experiments. The model and alignment dataset selected for optimization. To obtain more independent data points, we examined two models of OpenHermes-2.5-Mistral-7B and Zephyr-7B-Beta-Sft, and two alignment datasets Intel ORCA_DPO_PAIRS and UltraFeedback binary datasets.

In the first experiment, we used OpenHermes-2.5-Mistral-7B. This is because it is one of the best 7B parameter chat models that are not subject to alignment techniques. I then used Intel’s ORCA_DPO_PAIRS dataset. It consists of a 13K prompt where the selected response is generated by GPT-4 and the undesired response is generated by Llama-chat 13b. This is the dataset behind NeuralChat and NeuralHermes-2.5-Mistral-7B. KTO does not require pairwise preferences in itself, so it simply treats the level of the Llama-chat 13B as “bad” with the GPT-4 response as a “good” label. Although the response of GPT-4 may be preferred over Llama-chat 13b, we consider this to represent a few examples, although Llama-chat-13b may produce a better response.

In the second experiment, we performed a priority alignment of the Zephyr-7B-beta-SFT model. This includes a 66K prompt with a pair of selected and rejected responses. This dataset was used to train the original Zephyr model. This was the best in the class 7B model at the time with numerous automatic benchmarks and human ratings.

Experimental structure

The Alignment Handbook provides a simple way to organize a single experiment. These parameters are used to configure the run_dpo.py script.

model_name_or_path: Teknium/OpenHermes-2.5-Mistral-7B
torch_dtype: null

dataset_mixer:
Huggingfaceh4/orca_dpo_pairs: 1.0
dataset_splits:
– train_prefs
– test_prefs
preprocessing_num_workers: 12

BF16: truth
beta: 0.01
loss_type: sigmoid
do_eval: truth
do_train: truth
evaluation_strategy: Steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: truth
gradient_checkpointing_kwargs:
use_reintrant: error
hub_model_id: huggingfaceh4/openhermes-2.5-mistral-7b-dpo
hub_model_revision: v1.0

Learning_rate: 5.0E-7
logging_steps: 10
lr_scheduler_type: cosine
max_prompt_length: 512
num_train_epochs: 1
Best: adamw_torch
output_dir: Data/OpenHermes-2.5-Mistral-7B-DPO-V1.0
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
push_to_hub_revision: truth
save_strategy: “Step”
save_steps: 100
save_total_limit: 1
seed: 42
warmup_ratio: 0.1

I created a similar base configuration file for Zephyr experiments.

The chat template was automatically inferred from the Base Chat model, and OpenHermes-2.5 was used using CHATML format and Zephyr using the H4 chat template. Alternatively, if you use your own chat format, the 🤗Tonizer Library now enables user-defined chat templates using jinja-style strings.

“Message %}\n{%Message(‘role’)==’user’%}\n{{‘\n’ + message(‘content’) + eos_token}}\n{%elif message(‘role’)==’system’%}\n{‘\n’ + message(‘role’)==’assistant’%}\n{{‘\n’ + message(‘content’) + eos_token}}\n{%endif%}\n{%loop.last and add_generation_prompt%}\n{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

Format the conversation like this:

Hyperparameter sweep

I trained the DPO, IPO, and KTO methods via dpotrainer in loss_type argument TRL, and got the beta version from 0.01, 0.1, 0.2, …, 0.9. We included 0.01 as we observed that some alignment algorithms were particularly sensitive to this parameter. All experiments were trained for one era. All other hyperparameters are kept the same during each run, including random seeds.

I then launched a scan on the face cluster of hugs using the base configuration above. #gpurich

#! /bin/bash

configs =(“Zephyr” “OpenHermes”) loss_types =(“sigmoid” “kto_pair” “IPO”)Beta = (“0.01” “0.1” “0.2” “0.3” “0.4” “0.5” “0.6” “0.7” “0.8” “0.9”))

for config in “${configs(@)}“; do
for loss_type in “${loss_types(@)}“; do

for beta in “${betas(@)}“; do

job_name =“$config_${loss_type}_beta_${beta}“
model_revision =“${loss_type}–${beta}“

sbatch -job-name =${job_name} Recipes/launch.slurm dpo pref_align_scan config_$ config deepspeed_zero3 \\
” – Betta=${beta} -loss_type =${loss_type} -output_dir = data/$ config-7b-align-scan-${loss_type}-beta-${beta} -hub_model_revision =${model_revision}“
end
end
end

result

All models were evaluated using GPT-4 using the MT Bench, a multi-turn benchmark that judges the performance of the model in eight different categories: Writing, Roleplay, Inference, Mathematics, Coding, Extraction, STEM, and the Humanities. Although incomplete, the MT bench is a good way to evaluate LLM in conversation.

Zephyr-7b-beta-sft

Zephyr comparison

MT bench scores for Zephyr models for different β\Beta Beta.

In the Zephyr model, we observed that the best performance was achieved at the lowest β\BetaBeta Value, 0.01. This is consistent across all three algorithms tested. An interesting follow-up on community experiments is fine grain scans ranging from 0.0-0.2. DPO can achieve the best MT bench score, but we found that KTO (with pair) achieves better results in all settings except one. The IPO has a stronger theoretical guarantee, but appears to be worse than the base model in all but one settings.

Zephyr scan

Decomposition of the best Zephyr model for each algorithm across the MT bench category.

The best results for each algorithm can be decomposed across categories that MT Bench evaluates and identifies the advantages and disadvantages of these models. There is still a wide area to improve on the inference, coding and mathematics axis.

OpenHermes-7B-2.5

The observations on each algorithm remain the same in OpenHermes, but that is DPO>KTO>IPO, or sweet spot β\Beta Beta It varies greatly from algorithm to algorithm. With the best choice of β\Beta Beta For DPOs, the KTO and IPO are 0.6, 0.3, and 0.01, respectively.

OpenHermes comparison

Because the MT bench scores of OpenHermes models are different β\Beta Beta.

The OpenHermes-7B-2.5 is clearly a more powerful base model, with only 0.3 improvement in MT bench score after priority alignment.

OpenHermes scan

Decomposition of the optimal OpenHermes model for each algorithm across the MT bench category.

Summary and insights

In this post, I highlighted the importance of selecting the appropriate hyperparameters when performing a priority alignment. We demonstrate empirically that DPO and IPO can achieve comparable results, and can outperform KTO in paired settings.

All code and configuration files that replicate these results are now available in the Alignment Handbook. This collection features the best performance models and datasets.

What’s next?

Implement the new Preference Alignment Algorithms in the TRL and continue to evaluate its performance. For the time being, at least for the time being, DPO appears to be the most robust and performant LLM alignment algorithm. KTO remains an interesting development as both DPO and IPO require pair preference data, whereas KTO can be applied to any dataset in which the response is evaluated positively or negatively.

We look forward to the new tools and techniques that will be developed in 2024!

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleKaggle Game Arena evaluates AI models through the game
Next Article Grok app adds AI images and video generators in NSFW “spicy” mode
versatileai

Related Posts

Tools

Google AI tools accurately identify genetic causes of cancer

October 18, 2025
Tools

Run VLM on Intel CPUs in 3 easy steps

October 18, 2025
Tools

Bringing AI to the next generation of fusion energy

October 17, 2025
Add A Comment

Comments are closed.

Top Posts

🤗 Overview of quantization schemes natively supported by Transformers

October 13, 20253 Views

Corteva, Profluent partners use AI to enable more resilient crops

October 6, 20253 Views

Hugging Face and FriendliAI partner to enhance model deployment in hubs

January 22, 20253 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

🤗 Overview of quantization schemes natively supported by Transformers

October 13, 20253 Views

Corteva, Profluent partners use AI to enable more resilient crops

October 6, 20253 Views

Hugging Face and FriendliAI partner to enhance model deployment in hubs

January 22, 20253 Views
Don't Miss

Google AI tools accurately identify genetic causes of cancer

October 18, 2025

Run VLM on Intel CPUs in 3 easy steps

October 18, 2025

Epidemic Sound launches AI-powered soundtracking assistant for creators and brands

October 17, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?