Diffusion models (e.g. Dall-E 2, stable diffusion) are a class of generative models that have been widely successful, particularly by generating photorealistic types of images. However, images generated by these models are not always comparable to human preferences and human intentions. Therefore, there is an alignment problem. So, do you want to see how to make sure the output of the model matches human preferences such as “quality” or match the intentions that are difficult to express through the prompt? This is where reinforcement learning appears in photography.
In the world of large language models (LLMS), Renforce Learning (RL) has proven to be a highly effective tool for adjusting the model to human preferences. It is one of the main recipes behind the excellent performance of systems like ChatGpt. More precisely, RL is an important element of reinforcement learning from human feedback (RLHF) and chats ChatGpt like a human.
In training diffusion models with reinforcement learning, Black et al. We show how to enhance the diffusion model to fine-tune the RL with respect to the objective function via a method that utilizes RL to remove diffusion policy optimization (DDPO).
In this blog post, we’ll provide a brief explanation of how DDPO turned out, how it works, and how to incorporate DDPO into your RLHF workflow. We then explain how to quickly switch Gears to apply DDPO to models using the newly integrated DDPoTrainer from the TRL library, and explain the findings from the DDPO run on stable diffusion.
Benefits of DDPO
DDPO is not the only work answer to the question of how to try to fine-tune a diffusion model using RL.
Before diving, there are two important points to remember when it comes to understanding the benefits of one RL solution over the other solutions.
Computational efficiency is important. The more complex the data distribution, the more computational cost. The approximation is good, but the approximation is not authentic, so the associated errors are stacked up.
Before DDPO, reward-weighted regression (RWR) was an established method for fine-tuning diffusion models using reinforced learning. RWR reuses the reduction loss function of the diffusion model, along with training data sampled from the model itself and weighting the per-sample loss that depends on the reward associated with the final sample. This algorithm ignores intermediate removal procedures/samples. This works, but there are two things to note:
Optimizing by comparing the related losses, the objective of the related likelihood, is the related optimization, and the related losses are approximations derived from remeasured variational boundaries rather than accurate maximum likelihood targets.
The two orders of approximation have a major impact on both performance and ability to handle complex objectives.
DDPO uses this method as a starting point. Instead of viewing the removal step as a single step by focusing solely on the final sample, DDPO frames the entire removal process as a multi-step Markov decision process (MDP). In addition to using a fixed sampler, this formulation begins how agent policies become isotropic Gaussian, in contrast to any complex distribution. So instead of using the approximate possibility of the final sample (this is the path taken by RWR), here we have the exact possibility of each removal step that can be calculated very easily. ℓ(μ,σ2;x)= – n2log(2π)-n2log(σ2)-12σ2 ∑i = 1n(xi -μ)2 \ell(\mu,\sigma^2;x)= – \frac {n} {2}\log(2\pi) – \frac \log(\sigma^2) – \frac {1} {2 \sigma^2} \sum_{i = 1}^n(x_i – \mu)^2 ).
If you want to learn more about DDPO, we recommend checking out the original paper and accompanying blog posts.
Make the DDPO algorithm simple
Given the continuous nature of the removal process and the rest of the subsequent considerations, the choice tool for tackling optimization problems is the policy gradient method. Specifically, proximal policy optimization (PPO). The entire DDPO algorithm is roughly the same as proximal policy optimization (PPO), but as a side, the highly customized part is the trajectory collection part of the PPO
Here is a diagram to summarise the flow.
DDPO and RLHF: A mix to enforce aesthetics
General training aspects of RLHF can be roughly categorized into the following steps:
Fine-tuning that monitors the “base” model will learn the distribution of some new data, collect prioritization data, and train the reward model that uses it. Fine tune the model with reinforcement learning using reward models as signals.
It should be noted that preferred data is the primary source for capturing human feedback in the context of RLHF.
When you add DDPO to a mix, the workflow fluctuates as follows:
Starting with a prerequisite spreading model, we collect prioritization data and train a reward model that uses it. Fine-tune the model with DDPO using reward models as signals
Note that step 3 of a typical RLHF workflow is missing from the latter list of steps. This is because empirically (as you see for yourself) shows that this is not necessary.
Going to our venture, get a diffusion model and follow the steps below to output images along human perceived concepts of what it means to be aesthetics.
Starting with a pre-processed stable diffusion (SD) model, train a frozen clip model with a trainable regression head for an aesthetic visual analysis (AVA) dataset to predict whether to average the SD model and fine-tune the SD model with DDPO to fine-tune the DDPO.
Keep these steps in mind as you move on to actually doing these runs as described in the next section.
Training stable diffusion with DDPO
setting
To get started, you need at least access to the A100 NVIDIA GPU when it comes to the hardware side of things and the implementation of DDPO. All of this GPU type below will soon encounter out-of-memory issues.
Install the TRL library using PIP
PIP Install TRL (Diffuser)
This will install the main library. The following dependencies are for tracking and image logging: Once you have installed WANDB, be sure to log in and save the results to your personal account
PIP Install WandB TorchVision
Note: You can choose to use a tensorboard instead of WandB where you want to install the tensorboard package via PIP.
Walkthrough
The main classes in the TRL library responsible for DDPO training are the DDPoTrainer and DDPoConfig classes. For more information about ddpotrainer and ddpoconfig, see the documentation. The TRL repository contains examples of training scripts. Using the default implementation and default parameters of the required input, both of these classes are used in tandem to fine-tune the default preprocessed stable diffusion model from runwayML.
The script in this example uses WandB for logging and uses an aesthetic reward model that is read from a hagging face repository whose weights are facing public (and therefore, the aesthetic reward model has already been done by collecting data). The default prompt dataset used is a list of animal names.
There is only one command line flag argument that the user needs to get things up and running. Additionally, users are expected to have a Huggingface user access token used to upload Finetuning models to Huggingface Hub.
The following bash command does things:
python ddpo.py -hf_user_access_token
The following table contains key hyperparameters that are directly correlated with positive outcomes.
Parameter Description Recommended Values for Single GPU Training (currently) num_epochs 200 Use the batch size used for sampling using 3 Samples_Batch Sizes. sample_num_batches_per_epochWhether to track statistics per prompt per sampled per epoch. If false, the size of the buffer used for tracking per prompt for prompt Per_prompt_stat_tracking_buffer_size 32 Mixed_precision Mixed Precision Mixed Precision True Train_Learning_rateLearning_rate3E-4E-4E-4E-4E-4E-RATE LEARINT REATINE In contrast to tracking the size of REAST by prompt, benefits are calculated using the average and STD of the whole badge.
The script provided is merely a starting point. Adjust the hyperparameters and overhaul the script to accommodate various objective functions. For example, you can integrate functions that evaluate JPEG compression rates, or capabilities that evaluate visual text alignments using multimodal models.
Lessons learned
The results appear to be generalized at various prompts despite the minimal size of training prompt size. This has been thoroughly examined for objective functions that at least explicitly attempt to generalize the aesthetic objective functions by increasing the size of the training prompt and varying the prompt. It appears to produce images that are relatively more complex than Lora. However, it is extremely difficult to obtain the right hyperparameters for stable non-LORA driving. The following are the recommendations for non-LORA configuration parameters: I need to set the learning rate relatively low and something around 1E-5 performs the trick and set mixed_precision to none
result
Below are the outputs of Prompt Bear, Heaven, Dune’s previous acquisition (left) and postfine tuning (right) (each line is for single prompt output):
limit
Currently, TRL’s ddpotrainer is limited to the Finetuning Vanilla SD model. In our experiments we focused mainly on Lora, which works very well. We have performed some experiments with full training that could lead to better quality, but finding the right hyperparameters is more difficult.
Conclusion
Diffusion models, such as stable diffusion, can greatly improve the quality of the generated images, such that when fine-tuned using DDPO, humans or other metrics are perceived as appropriately conceptualized as objective features.
Without relying on the computational efficiency and approximation of DDPO, it is a suitable candidate for fine-tuned diffusion models, particularly because of its ability to optimize in the early methods to achieve the same goals of fine-tuned diffusion models.
The ddpotrainer in the TRL library implements DDPO for the Finetuning SD model.
Our experimental findings highlight the strength of DDPO generalizations to a wide range of prompts, although explicit generalization attempts through various prompts have resulted in a variety of results. The difficulty of finding suitable hyperparameters for non-LORA setups also emerged as an important learning.
DDPO is a promising technique for fitting the spreading model into all reward features, and we hope that its release in TRL will make it more accessible to the community.
Acknowledgments
Thank you to Chunte Lee for giving me a thumbnail for this blog post.

