Direct Preference Optimization Explained In-depth
Simpler preference-tuning without reinforcement learning
April 2024
With my first blog post, I want to cover an excellent paper that was published last year: Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al.
Commonly referred to as DPO, this method of preference tuning is an alternative to Reinforcement Learning from Human Feedback (RLHF) that avoids the actual reinforcement learning. In this blog post, I will explain DPO from first principles; readers do not need an understanding of RLHF. However, fair warning that there will be some math involved - mostly probability, algebra, and optimization - but I will do my best to explain everything clearly.
Training, tuning, and aligning LLMs
To contextualize DPO, and preference-tuning in general, let’s review the modern process for creating language models such as ChatGPT or Claude. The following steps are sequential, with each one building upon the previous:
-
Pre-train a base model on internet-scale data. Given a snippet of text, this model is trained to predict the immediate next word. This conceptually simple task scales up extremely well and allows LLMs to encode a huge amount of knowledge from their training data. Examples of base models include GPT-3, Llama3, and Mistral.
-
Take a pre-trained base model and fine-tune it on a task-specific dataset of demonstrations. For example, if you are trying to create a helpful dialog model like ChatGPT, you would want to tune your model on a dataset of conversational dialog, so that your model’s outputs sound more like parts of a conversation and less like a Wikipedia page. In this stage, we still use the next word prediction task, and the fine-tuning procedure updates our model to make predictions that more closely align with the high-quality task-specific examples we are feeding it. Examples of fine-tuned models in this stage are Alpaca, Vicuna and Mistral-Instruct.
-
Finally, we fine-tune the model based on human preferences. Human preferences are powerful because they are so easily and cheaply expressed. Think of how easy it is to compare two movies and pick a favorite. Yet how difficult it would be to make a film that embodies the qualities that drive you to visit a theater. Similarly, it is challenging to describe exactly how we want our model to behave (as we attempt to do in step 2), but given examples of model behavior it is straightforward to indicate a preference for a specific type of behavior. For a while, this sort of preference-tuning was done using RLHF. Recently, RLHF has been somewhat supplanted by DPO due to the relative simplicity of the latter. LLMs that have been tuned using human preferences include Llama 3 Instruct, ChatGPT-4, Claude 3 Opus, and Gemini Ultra.
The Gemini whitepaper provides a nice visual representation of these stages:
Tuning LLMs on preference data
It is hard and time-consuming work to create high-quality demonstrations of the behavior we want our LLM to mimic. And it would be expensive to hire labelers to help us create such data. However, once we have a model that is “good enough” at demonstrating desired behavior, we can shift into high gear. Given a prompt, we can sample two different responses from our LLM by injecting a small amount of randomnessThis is typically done by generating text with a temperature
that is greater than zero. Here is a lovely little demo that explains how temperature affects model outputs visually.. Now, it is cheap and easy to have a labeler express a preference for one of the two completions.
While using ChatGPT or Gemini, you may have noticed that you will occasionally be asked to choose between two similar answers from which to continue your conversation. This preference is recorded and used to improve the model in a future round of preference-tuning. Similarly, Chatbot Arena collects preference data for the purpose of rating LLMs based on human assessments:
There are many publicly available preference datasets, such as LMSys’ Chatbot Arena Conversations dataset, OpenAI’s WebGPT Comparisons dataset, and Anthropic’s Helpfulness-Harmlessness RLHF dataset (explicit/offensive content warning).
Formally, these datasets can be expressed as follows:
D={x(i),yw(i),yl(i)}i=1N Where x is the context/prompt, yw is the preferred completion, and yl is the less desirable completion.
The Bradley-Terry Model
So what do we do with all this preference data? We want to leverage it to modify our LLM to output responses that better conform to the preferences. To begin, let us explore a simple probability model:
p∗(i≻j)=si+sjsi This is the Bradley-Terry model, which is a model for the outcome of pairwise comparisons. In plain English, it says "We model the trueThis is the reason for the “star” in p∗: to indicate that we are modeling the true underlying distribution of human preferences. Likewise, shortly we will see r∗, which indicates the true underlying reward function that grades our completions, and π∗, which indicates the optimal policy we want our LLM to mimic. probability that outcome i is preferred to outcome j as the score of i over the combined scores of i and j".
Readers may be familiar with the Bradley-Terry model from the context of Elo scores, which are popular in chess and other competitive games. The Bradley-Terry model is a generalization of the Elo rating system, where the probability of player A beating player B is given by p(A≻B)=1+10(RB−RA)/4001=sA+sBsA. Here R indicates a player’s ratingSo if player A’s Elo rating is 2000 and player B’s is 1600 then player A is expected to be 10 times more likely to win than player B, because p(A≻B)=1+10(1600−2000)/4001=10/11. and s=10R/400.
Under the Bradley-Terry model, is common to choose to parameterize the score as s=er, where r stands for reward. The term “reward” is borrowed from the world of reinforcement learning, where greater rewards are received for a more desirable series of actions - similar to achieving a higher score for performing better in a video game.
With this parameterization, our model starts to look pretty nice - a simple difference in reward values passed through the logistic functionThe logistic function is an S-shaped (or sigmoid) function commonly denoted using σ(x). It frequently appears when working with probabilities because it can “squash” values in R (the set of all real numbers) into (0,1) (the set of probabilities values, excluding exactly 0 or 1). .
p∗(i≻j)=si+sjsi=eri∗+erj∗eri∗=1+e−(ri∗−rj∗)1=σ(ri∗−rj∗) Applying the Bradley-Terry Model to LLMs
Now, we want to take the Bradley-Terry model and leverage it alongside a dataset of preferences in order to improve our LLM’s generated outputs.
In our preference dataset (D), we have two comparisons and we want to model the probability of one completion being preferred over the other. In a sense, each completion elicits some reward based on its quality, and our ultimate goal will be to nudge our LLM to produce completions that are of higher quality. Therefore, we will parameterize the reward using our LLM. We will call this reward r∗(x,y), which just means that the reward is a function of the context/prompt (x) and the completion (y).
So after adapting our preference model to use our parameterized reward function, we have:
p∗(y1≻y2∣x)=σ(r∗(x,y1)−r∗(x,y2)) But talking in terms of optimal solutions and rewards does us no good, since we do not have access to the optimal reward function. In practice, it is common to learn a reward model rϕ(x,y) that mimics the optimal reward function. We can estimate the parameters ϕ of this reward model by framing this as a binary classification problem where our objective is to minimize the following negative log-likelihood loss function on our preference dataset D:E(x,y1,y2)∼D[f(x,yw,yl)] is just a formal way of saying "the expected value of function f on data points sampled from our preference dataset".
LR(rϕ,D)=−E(x,yw,yl)∼D[log(σ(rϕ(x,yw)−rϕ(x,yl)))] Under the RLHF framework, we could leverage this learned reward model in a reinforcement learning setting to optimize an LLM to output completions that achieve high rewards. However, DPO takes a different tack - instead of the two-stage RLHF process, DPO reparameterizes the Bradley-Terry model so that we can use a similar loss function to directly optimize the parameters of our LLM such that it produces outputs that are preferred by human observers.
The probability of a completion
At this point, the idea of optimizing LLMs based on preferences or rewards may feel fairly abstract. So we’re going to take a moment to introduce a new probability function, π(y∣x), that represents the literal output of our LLM. In reinforcement learning notation, π indicates a policy (i.e. a strategy), and policies are optimized to maximize reward. Specifically, πθ(y∣x) is the probability of generating the completion y based on an LLM with parameters θ given that we start with prompt x.
What do we mean by "the probability of generating the completion y"? Our LLM is an auto-regressive text generator, and, upon each auto-regressive step, it computes a probability value for every wordIn practice, modern LLMs operate on tokens, not words. For our purposes, the difference doesn’t really matter. You can learn more by playing with an online tokenizer demo or digging through Karparthy’s minbpe repo. in its vocabulary.
So - proceeding in order through every word in completion y - we compute the probability of the next word in the completion given all of the proceeding words. Now, we have a probability value for every word in the completion! So we can compute the joint probability of generating the sequence of words as the product of the individual probabilities of observing each word along the wayMultiplying probabilities can result in numerical underflow. It is common to instead work with logprobs: ∏ipi=e∑ilogpi. Since every term in the summation of logprobs increases the magnitude of its output, underflow is avoided. OpenAI has a nice guide to using token logprobs returned by an LLM.:
πθ(y∣x)=t=0∏∣y∣pLLMθ(yt∣x,y0:t) Another way to think about it is that there is a tree of possible completions and we are computing the probability of tracing one specific path from the root (end of the prompt) to a leaf (stop-token).
When training, we know the entire text completion ahead of time, so, by applying a causal attention mask, we can calculate all of the the individual next-word probabilities (and thus πθ(y∣x)) via a single forward pass through our LLM.
Optimizing our LLM based on preferences
Ok, so now that we’ve got our framework in place. Let us remind ourselves of our goal: to improve the outputs of our LLM. Stated another way, we want the completion (y) our LLM provides for a prompt (x) to generate a large reward r(x,y). With this in mind, we can formulate an optimization problem where we want to find the parameters of our LLM (θ) that maximize our expected reward for prompts similar to those we see in practice.Ex∼D,y∼πθ(y∣x)[r(x,y)] is just a formal way of saying "the expected reward attained by completions generated/sampled from our model (y∼πθ(y∣x)) based on prompts sampled from our dataset (x∼D)".
θmaxEx∼D,y∼πθ(y∣x)[r(x,y)] This is a bit too simplistic, however. In practice, we start with the parameters of our fine-tuned base model, and we have some belief that the outputs generated by our fine-tuned base model are pretty good, so we don’t want the outputs of our model to change too much unless they improve the reward significantly. With that in mind, we amend our optimization problem to include a regularization constraint to help enforce this belief.
θmaxEx∼D,y∼πθ(y∣x)[r(x,y)]−βDKL[πθ(y∣x) ∥ πref(y∣x)] DKL[P∥Q] is the Kullback-Leibler divergenceKL divergence is one of many traditional methods for regularizing an RL agent’s policy. In the cases of DPO and RLHF, it is a natural choice because we begin with a strong reference policy at hand - the LLM output by our fine-tuning procedure., a statistical distance measure. It quantifies how the probability distribution P differs from probability distribution Q. This constraint based on the KL divergence just encodes the idea that we want to penalize outputs from our model (πθ) based on how much they differ from outputs from the fine-tuned model (e.g. the reference model) we started with (πref). β is a scalar hyperparameter that controls the strength of the constraint.
Now, we want to derive the optimal solution to this optimization problem. This will rely on Gibb’s Inequality - the fact that DKL[P∥Q]≥0 and DKL[P∥Q]=0 if and only if P=Q.The intuition here is that the KL-divergence is a distance measure (kind of), and there is no distance between P and Q if they are equal, and there must be some distance if they are not equal.
πθmaxEx∼D,y∼πθ(y∣x)[r(x,y)]−βDKL[πθ(y∣x) ∥ πref(y∣x)]=πθmaxEx∼D,y∼πθ(y∣x)[r(x,y)]−βEy∼πθ(y∣x)[logπref(y∣x)πθ(y∣x)]=πθmaxEx∼DEy∼πθ(y∣x)[r(x,y)−βlogπref(y∣x)πθ(y∣x)]=πθminEx∼DEy∼πθ(y∣x)[logπref(y∣x)πθ(y∣x)−β1r(x,y)]=πθminEx∼DEy∼πθ(y∣x)⎣⎡logZ(x)1πref(y∣x)eβ1r(x,y)πθ(y∣x)−logZ(x)⎦⎤=... where Z(x)=∑yπref(y∣x)eβ1r(x,y). Importantly, this Z(x) term depends only on x and πref and not on y or πθ. This lets us do a bit of reorganizing from where we just left off.
...=πθminEx∼D⎣⎡Ey∼πθ(y∣x)⎣⎡logZ(x)1πref(y∣x)eβ1r(x,y)πθ(y∣x)⎦⎤−logZ(x)⎦⎤=πθminEx∼D[DKL(πθ(y∣x) ∥ Z(x)1πref(y∣x)eβ1r(x,y))−logZ(x)] And we have nearly arrived! Since Z(x) does not depend on πθ, we can just ignore it when deriving the optimal solution. We can now use Gibb’s inequality as mentioned above: DKL(πθ(y∣x) ∥ Z(x)1πref(y∣x)eβ1r(x,y)) is minimized at zero if, and only if, the two distributions on either side of ∥ are identical. So, the optimal solution (denoted as π∗) to our optimization problem for all x∈D is:
π∗(y∣x)=πθ(y∣x)=Z(x)1πref(y∣x)eβ1r(x,y) Direct Preference Optimization
So we know the optimal solution to our optimization problem, but can we access it? No. The term Z(x)=∑yπref(y∣x)eβ1r(x,y) is intractable - computing it requires summing over every possible string of words.
Instead, we can reorganize the optimal solution from above such that we express the reward function in terms of the optimal policy πθ, the reference policy πref, and the intractable function Z:
r(x,y)=βlogπref(y∣x)πθ(y∣x)+βlogZ(x) This same reorganization can be applied using the underlying ground-truth reward r∗ and its corresponding optimal policy π∗.
r∗(x,y)=βlogπref(y∣x)π∗(y∣x)+βlogZ(x) Now here comes the clever trick noticed by the authors of DPO. We can use this reorganized expression of the optimal solution to our optimization problem to reparameterize the Bradley-Terry preference model from above so that it is expressed in terms of an optimal policy π∗ and not in terms of an underlying reward function! And even better, once we plug everything in, we notice that the intractable Z(x) function cancels out!
p∗(y1≻y2∣x)=σ(r∗(x,y1)−r∗(x,y2))=σ(βlogπref(y1∣x)π∗(y1∣x)+βlogZ(x)−(βlogπref(y2∣x)π∗(y2∣x)+βlogZ(x)))=σ(βlogπref(y1∣x)π∗(y1∣x)−βlogπref(y2∣x)π∗(y2∣x)) Now, with our reparameterized Bradley-Terry model, we can use supervised learning to directly learn a policy that mimics the optimal policy. We can minimize a negative log-likelihood loss function over our preference dataset D to estimate the parameters of our policy πθ:
LDPO(πθ;πref)=−E(yw,yl,x)∼D[log(σ(βlogπref(yw∣x)πθ(yw∣x)−βlogπref(yl∣x)πθ(yl∣x)))]=−E(yw,yl,x)∼D[log(σ(β(logπθ(yl∣x)πθ(yw∣x)−logπref(yl∣x)πref(yw∣x))))] Recall that above we optimized a negative log-likelihood loss to estimate the parameters of a reward model that was then used downstream by RLHF to estimate the parameters of a policy model. But now we are directly optimizing the parameters of our LLM policy model based on human preferences! Thus, Direct Preference Optimization.
To be explicit about the benefits of DPO over RLHF:
- We avoid the need to train a reward model to estimate human preferences.
- We avoid needing to perform any type of reinforcement learning, which is notoriously difficult and requires a lot of tribal knowledge to get right.
- We can directly optimize our LLM on human preferences using supervised learning, which is a much more straightforward and well-understood process.
The avoidance of reinforcement learning is particularly important. DPO has made preference-tuning a much more accessible process for practitioners who may not have the time, resources, or expertise to navigate the complexities of reinforcement learning.
Properties and Caveats of DPO
One of the key properties of DPO is that when the Bradley-Terry model perfectly fits our preference data and RLHF learns the optimal reward function, then the global optimizer of RHLF and DPO is the same.
This is an important equivalence result; however, in practice:
- The Bradley-Terry model often does not perfectly fit the preference data.For example, a preference cycle would cause the Bradley-Terry model to fail to perfectly fit the data. The Bradley-Terry model assumes transitive preferences. For example, if A≻B and B≻C then it expects that A≻C. But if instead C≻A, then there is a cycle and transitivity is broken.
- The reward function learned by RLHF will not be the optimal reward function.
- Gradient descent on a highly non-convex loss landscape - such as that of an LLM - does not find the global optimizer.
Another weakness of DPO is that it is prone to overfitting due to a lack of regularization. Azar et al. provide a compelling exampleThe original notation of the quote has been adjusted slightly to match the rest of this post.:
Consider the simple example where we have two actions y1 and y2 such that p∗(y1≻y2)=1, i.e., y1 is always preferred to y2. Then the Bradley-Terry model would require that (r(y1)−r(y2))→+∞ to [be satisfied]. If we plug this into the optimal policy then we would get that π∗(y1)π∗(y2)=0 (i.e. π∗(y2)=0) … Thus the strength of the KL-regularization becomes weaker and weaker the more deterministic the preferences.
They also point out that, in practice, we have a finite amount of preference data. Therefore, we are likely to empirically estimate p^(y1≻y2)=1 simply because we’ve only seen a small number of comparisons between y and y′. Therefore the empirical optimal policy would push π(y2)=0 regardless of the regularization term that is attempting to keep the policy similar to our reference policy.
Despite these shortcomings, DPO is a highly effective tool; at the time of writing, many of the most successful and performant open-source LLMs were instruction-tuned using DPO.
Interested in learning more?
I highly recommend reading the DPO paper. In this post, we’ve done a deep dive into the derivation of the DPO objective, but the paper covers other points of interest, such as experimental results and additional theoretical properties.
And if you’re interested in learning more about preference-tuning in general, here are additional resources that provide a deeper dive into the topic:
References
[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv. https://arxiv.org/abs/2305.18290.
[2] Bertrand, Q., Czarnecki, W. M., & Gidel, G. (2023). On the limitations of Elo: Real-world games are transitive, not additive. arXiv. https://arxiv.org/abs/2206.12301.
[3] Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A General Theoretical Paradigm to Understand Learning from Human Preferences. arXiv. https://arxiv.org/abs/2310.12036.
[4] Jitkrittum, W. (2013). Log-Sum-Exp Trick to Prevent Numerical Underflow. http://wittawat.com/posts/log-sum_exp_underflow.html
[5] Gemini Team (2024). Gemini: A Family of Highly Capable Multimodal Models. arXiv. https://arxiv.org/abs/2312.11805.
[6] Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2020). What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv. https://arxiv.org/abs/2006.05990.