Direct Preference Optimization Explained Indepth
Simpler preferencetuning without reinforcement learning
April 2024
With my first blog post, I want to cover an excellent paper that was published last year: Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al.
Commonly referred to as DPO, this method of preference tuning is an alternative to Reinforcement Learning from Human Feedback (RLHF) that avoids the actual reinforcement learning. In this blog post, I will explain DPO from first principles; readers do not need an understanding of RLHF. However, fair warning that there will be some math involved  mostly probability, algebra, and optimization  but I will do my best to explain everything clearly.
Training, tuning, and aligning LLMs
To contextualize DPO, and preferencetuning in general, let’s review the modern process for creating language models such as ChatGPT or Claude. The following steps are sequential, with each one building upon the previous:

Pretrain a base model on internetscale data. Given a snippet of text, this model is trained to predict the immediate next word. This conceptually simple task scales up extremely well and allows LLMs to encode a huge amount of knowledge from their training data. Examples of base models include GPT3, Llama3, and Mistral.

Take a pretrained base model and finetune it on a taskspecific dataset of demonstrations. For example, if you are trying to create a helpful dialog model like ChatGPT, you would want to tune your model on a dataset of conversational dialog, so that your model’s outputs sound more like parts of a conversation and less like a Wikipedia page. In this stage, we still use the next word prediction task, and the finetuning procedure updates our model to make predictions that more closely align with the highquality taskspecific examples we are feeding it. Examples of finetuned models in this stage are Alpaca, Vicuna and MistralInstruct.

Finally, we finetune the model based on human preferences. Human preferences are powerful because they are so easily and cheaply expressed. Think of how easy it is to compare two movies and pick a favorite. Yet how difficult it would be to make a film that embodies the qualities that drive you to visit a theater. Similarly, it is challenging to describe exactly how we want our model to behave (as we attempt to do in step 2), but given examples of model behavior it is straightforward to indicate a preference for a specific type of behavior. For a while, this sort of preferencetuning was done using RLHF. Recently, RLHF has been somewhat supplanted by DPO due to the relative simplicity of the latter. LLMs that have been tuned using human preferences include Llama 3 Instruct, ChatGPT4, Claude 3 Opus, and Gemini Ultra.
The Gemini whitepaper provides a nice visual representation of these stages:
Tuning LLMs on preference data
It is hard and timeconsuming work to create highquality demonstrations of the behavior we want our LLM to mimic. And it would be expensive to hire labelers to help us create such data. However, once we have a model that is “good enough” at demonstrating desired behavior, we can shift into high gear. Given a prompt, we can sample two different responses from our LLM by injecting a small amount of randomnessThis is typically done by generating text with a temperature
that is greater than zero. Here is a lovely little demo that explains how temperature affects model outputs visually.. Now, it is cheap and easy to have a labeler express a preference for one of the two completions.
While using ChatGPT or Gemini, you may have noticed that you will occasionally be asked to choose between two similar answers from which to continue your conversation. This preference is recorded and used to improve the model in a future round of preferencetuning. Similarly, Chatbot Arena collects preference data for the purpose of rating LLMs based on human assessments:
There are many publicly available preference datasets, such as LMSys’ Chatbot Arena Conversations dataset, OpenAI’s WebGPT Comparisons dataset, and Anthropic’s HelpfulnessHarmlessness RLHF dataset (explicit/offensive content warning).
Formally, these datasets can be expressed as follows:
Where $x$ is the context/prompt, $y_w$ is the preferred completion, and $y_l$ is the less desirable completion.
The BradleyTerry Model
So what do we do with all this preference data? We want to leverage it to modify our LLM to output responses that better conform to the preferences. To begin, let us explore a simple probability model:
This is the BradleyTerry model, which is a model for the outcome of pairwise comparisons. In plain English, it says "We model the trueThis is the reason for the “star” in $p^*$: to indicate that we are modeling the true underlying distribution of human preferences. Likewise, shortly we will see $r^*$, which indicates the true underlying reward function that grades our completions, and $\pi^*$, which indicates the optimal policy we want our LLM to mimic. probability that outcome $i$ is preferred to outcome $j$ as the score of $i$ over the combined scores of $i$ and $j$".
Readers may be familiar with the BradleyTerry model from the context of Elo scores, which are popular in chess and other competitive games. The BradleyTerry model is a generalization of the Elo rating system, where the probability of player A beating player B is given by $p(A \succ B) = \frac{1}{1 + 10^{(R_BR_A)/400}} = \frac{s_A}{s_A + s_B}$. Here $R$ indicates a player’s ratingSo if player A’s Elo rating is 2000 and player B’s is 1600 then player A is expected to be 10 times more likely to win than player B, because $p(A \succ B)=\frac{1}{1 + 10^{(16002000)/400}}=10/11$. and $s = 10^{R/400}$.
Under the BradleyTerry model, is common to choose to parameterize the score as $s=e^r$, where $r$ stands for reward. The term “reward” is borrowed from the world of reinforcement learning, where greater rewards are received for a more desirable series of actions  similar to achieving a higher score for performing better in a video game.
With this parameterization, our model starts to look pretty nice  a simple difference in reward values passed through the logistic functionThe logistic function is an Sshaped (or sigmoid) function commonly denoted using $\sigma(x)$. It frequently appears when working with probabilities because it can “squash” values in $\mathbb{R}$ (the set of all real numbers) into $(0, 1)$ (the set of probabilities values, excluding exactly 0 or 1). .
Applying the BradleyTerry Model to LLMs
Now, we want to take the BradleyTerry model and leverage it alongside a dataset of preferences in order to improve our LLM’s generated outputs.
In our preference dataset ($\mathcal{D}$), we have two comparisons and we want to model the probability of one completion being preferred over the other. In a sense, each completion elicits some reward based on its quality, and our ultimate goal will be to nudge our LLM to produce completions that are of higher quality. Therefore, we will parameterize the reward using our LLM. We will call this reward $r^*(x, y)$, which just means that the reward is a function of the context/prompt ($x$) and the completion ($y$).
So after adapting our preference model to use our parameterized reward function, we have:
But talking in terms of optimal solutions and rewards does us no good, since we do not have access to the optimal reward function. In practice, it is common to learn a reward model $r_\phi(x, y)$ that mimics the optimal reward function. We can estimate the parameters $\phi$ of this reward model by framing this as a binary classification problem where our objective is to minimize the following negative loglikelihood loss function on our preference dataset $\mathcal{D}$:$\mathbb{E}_{(x,y_1,y_2)\sim \mathcal{D}}[f(x,y_w,y_l)]$ is just a formal way of saying "the expected value of function $f$ on data points sampled from our preference dataset".
Under the RLHF framework, we could leverage this learned reward model in a reinforcement learning setting to optimize an LLM to output completions that achieve high rewards. However, DPO takes a different tack  instead of the twostage RLHF process, DPO reparameterizes the BradleyTerry model so that we can use a similar loss function to directly optimize the parameters of our LLM such that it produces outputs that are preferred by human observers.
The probability of a completion
At this point, the idea of optimizing LLMs based on preferences or rewards may feel fairly abstract. So we’re going to take a moment to introduce a new probability function, $\pi(yx)$, that represents the literal output of our LLM. In reinforcement learning notation, $\pi$ indicates a policy (i.e. a strategy), and policies are optimized to maximize reward. Specifically, $\pi_\theta(yx)$ is the probability of generating the completion $y$ based on an LLM with parameters $\theta$ given that we start with prompt $x$.
What do we mean by "the probability of generating the completion $y$"? Our LLM is an autoregressive text generator, and, upon each autoregressive step, it computes a probability value for every wordIn practice, modern LLMs operate on tokens, not words. For our purposes, the difference doesn’t really matter. You can learn more by playing with an online tokenizer demo or digging through Karparthy’s minbpe repo. in its vocabulary.
So  proceeding in order through every word in completion $y$  we compute the probability of the next word in the completion given all of the proceeding words. Now, we have a probability value for every word in the completion! So we can compute the joint probability of generating the sequence of words as the product of the individual probabilities of observing each word along the wayMultiplying probabilities can result in numerical underflow. It is common to instead work with logprobs: $\prod_i p_i=e^{\sum_i log p_i}$. Since every term in the summation of logprobs increases the magnitude of its output, underflow is avoided. OpenAI has a nice guide to using token logprobs returned by an LLM.:
Another way to think about it is that there is a tree of possible completions and we are computing the probability of tracing one specific path from the root (end of the prompt) to a leaf (stoptoken).
When training, we know the entire text completion ahead of time, so, by applying a causal attention mask, we can calculate all of the the individual nextword probabilities (and thus $\pi_\theta(yx)$) via a single forward pass through our LLM.
Optimizing our LLM based on preferences
Ok, so now that we’ve got our framework in place. Let us remind ourselves of our goal: to improve the outputs of our LLM. Stated another way, we want the completion (y) our LLM provides for a prompt (x) to generate a large reward $r(x, y)$. With this in mind, we can formulate an optimization problem where we want to find the parameters of our LLM ($\theta$) that maximize our expected reward for prompts similar to those we see in practice.$\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(yx)}[r(x, y)]$ is just a formal way of saying "the expected reward attained by completions generated/sampled from our model ($y\sim \pi_\theta(yx)$) based on prompts sampled from our dataset ($x\sim \mathcal{D}$)".
This is a bit too simplistic, however. In practice, we start with the parameters of our finetuned base model, and we have some belief that the outputs generated by our finetuned base model are pretty good, so we don’t want the outputs of our model to change too much unless they improve the reward significantly. With that in mind, we amend our optimization problem to include a regularization constraint to help enforce this belief.
$\mathbb{D}_{KL}[P \Vert Q]$ is the KullbackLeibler divergenceKL divergence is one of many traditional methods for regularizing an RL agent’s policy. In the cases of DPO and RLHF, it is a natural choice because we begin with a strong reference policy at hand  the LLM output by our finetuning procedure., a statistical distance measure. It quantifies how the probability distribution P differs from probability distribution Q. This constraint based on the KL divergence just encodes the idea that we want to penalize outputs from our model ($\pi_\theta$) based on how much they differ from outputs from the finetuned model (e.g. the reference model) we started with ($\pi_{ref}$). $\beta$ is a scalar hyperparameter that controls the strength of the constraint.
Now, we want to derive the optimal solution to this optimization problem. This will rely on Gibb’s Inequality  the fact that $\mathbb{D}_{KL}[P \Vert Q]\geq0$ and $\mathbb{D}_{KL}[P \Vert Q]=0$ if and only if $P=Q$.The intuition here is that the KLdivergence is a distance measure (kind of), and there is no distance between P and Q if they are equal, and there must be some distance if they are not equal.
where $Z(x)=\sum_y\pi_{ref}(yx)e^{\frac{1}{\beta}r(x,y)}$. Importantly, this $Z(x)$ term depends only on $x$ and $\pi_{ref}$ and not on $y$ or $\pi_\theta$. This lets us do a bit of reorganizing from where we just left off.
And we have nearly arrived! Since $Z(x)$ does not depend on $\pi_\theta$, we can just ignore it when deriving the optimal solution. We can now use Gibb’s inequality as mentioned above: $\mathbb{D}_{KL}\left(\pi_\theta(yx)\ \Vert\ \frac{1}{Z(x)}\pi_{ref}(yx)e^{\frac{1}{\beta}r(x,y)}\right)$ is minimized at zero if, and only if, the two distributions on either side of $\Vert$ are identical. So, the optimal solution (denoted as $\pi^*$) to our optimization problem for all $x \in \mathcal{D}$ is:
Direct Preference Optimization
So we know the optimal solution to our optimization problem, but can we access it? No. The term $Z(x)=\sum_y\pi_{ref}(yx)e^{\frac{1}{\beta}r(x,y)}$ is intractable  computing it requires summing over every possible string of words.
Instead, we can reorganize the optimal solution from above such that we express the reward function in terms of the optimal policy $\pi_\theta$, the reference policy $\pi_{ref}$, and the intractable function $Z$:
This same reorganization can be applied using the underlying groundtruth reward $r^*$ and its corresponding optimal policy $\pi^*$.
Now here comes the clever trick noticed by the authors of DPO. We can use this reorganized expression of the optimal solution to our optimization problem to reparameterize the BradleyTerry preference model from above so that it is expressed in terms of an optimal policy $\pi^*$ and not in terms of an underlying reward function! And even better, once we plug everything in, we notice that the intractable $Z(x)$ function cancels out!
Now, with our reparameterized BradleyTerry model, we can use supervised learning to directly learn a policy that mimics the optimal policy. We can minimize a negative loglikelihood loss function over our preference dataset $\mathcal{D}$ to estimate the parameters of our policy $\pi_\theta$:
Recall that above we optimized a negative loglikelihood loss to estimate the parameters of a reward model that was then used downstream by RLHF to estimate the parameters of a policy model. But now we are directly optimizing the parameters of our LLM policy model based on human preferences! Thus, Direct Preference Optimization.
To be explicit about the benefits of DPO over RLHF:
 We avoid the need to train a reward model to estimate human preferences.
 We avoid needing to perform any type of reinforcement learning, which is notoriously difficult and requires a lot of tribal knowledge to get right.
 We can directly optimize our LLM on human preferences using supervised learning, which is a much more straightforward and wellunderstood process.
The avoidance of reinforcement learning is particularly important. DPO has made preferencetuning a much more accessible process for practitioners who may not have the time, resources, or expertise to navigate the complexities of reinforcement learning.
Properties and Caveats of DPO
One of the key properties of DPO is that when the BradleyTerry model perfectly fits our preference data and RLHF learns the optimal reward function, then the global optimizer of RHLF and DPO is the same.
This is an important equivalence result; however, in practice:
 The BradleyTerry model often does not perfectly fit the preference data.For example, a preference cycle would cause the BradleyTerry model to fail to perfectly fit the data. The BradleyTerry model assumes transitive preferences. For example, if $A \succ B$ and $B \succ C$ then it expects that $A \succ C$. But if instead $C \succ A$, then there is a cycle and transitivity is broken.
 The reward function learned by RLHF will not be the optimal reward function.
 Gradient descent on a highly nonconvex loss landscape  such as that of an LLM  does not find the global optimizer.
Another weakness of DPO is that it is prone to overfitting due to a lack of regularization. Azar et al. provide a compelling exampleThe original notation of the quote has been adjusted slightly to match the rest of this post.:
Consider the simple example where we have two actions $y_1$ and $y_2$ such that $p^*(y_1 \succ y_2)=1$, i.e., $y_1$ is always preferred to $y_2$. Then the BradleyTerry model would require that $(r(y_1)r(y_2))\rightarrow+\infty$ to [be satisfied]. If we plug this into the optimal policy then we would get that $\frac{\pi^*(y_2)}{\pi^*(y_1)}=0$ (i.e. $\pi^*(y_2)=0$) … Thus the strength of the KLregularization becomes weaker and weaker the more deterministic the preferences.
They also point out that, in practice, we have a finite amount of preference data. Therefore, we are likely to empirically estimate $\hat{p}(y_1 \succ y_2)=1$ simply because we’ve only seen a small number of comparisons between $y$ and $y'$. Therefore the empirical optimal policy would push $\pi(y_2)=0$ regardless of the regularization term that is attempting to keep the policy similar to our reference policy.
Despite these shortcomings, DPO is a highly effective tool; at the time of writing, many of the most successful and performant opensource LLMs were instructiontuned using DPO.
Interested in learning more?
I highly recommend reading the DPO paper. In this post, we’ve done a deep dive into the derivation of the DPO objective, but the paper covers other points of interest, such as experimental results and additional theoretical properties.
And if you’re interested in learning more about preferencetuning in general, here are additional resources that provide a deeper dive into the topic:
 OpenAI’s post on aligning language models to follow human instructions (and the InstructGPT paper)
 HuggingFace’s post on finetuning Llama2 with DPO
 Direct Nash Optimization, a recently proposed approach, avoids using the BradleyTerry model altogether since the BradleyTerry model fails to express complex intransitive or cyclic preference relations.
References
[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv. https://arxiv.org/abs/2305.18290.
[2] Bertrand, Q., Czarnecki, W. M., & Gidel, G. (2023). On the limitations of Elo: Realworld games are transitive, not additive. arXiv. https://arxiv.org/abs/2206.12301.
[3] Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A General Theoretical Paradigm to Understand Learning from Human Preferences. arXiv. https://arxiv.org/abs/2310.12036.
[4] Jitkrittum, W. (2013). LogSumExp Trick to Prevent Numerical Underflow. http://wittawat.com/posts/logsum_exp_underflow.html
[5] Gemini Team (2024). Gemini: A Family of Highly Capable Multimodal Models. arXiv. https://arxiv.org/abs/2312.11805.
[6] Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2020). What Matters In OnPolicy Reinforcement Learning? A LargeScale Empirical Study. arXiv. https://arxiv.org/abs/2006.05990.