Direct Preference Optimization Explained In-depth

Simpler preference-tuning without reinforcement learning

April 2024

With my first blog post, I want to cover an excellent paper that was published last year: Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al.

Commonly referred to as DPO, this method of preference tuning is an alternative to Reinforcement Learning from Human Feedback (RLHF) that avoids the actual reinforcement learning. In this blog post, I will explain DPO from first principles; readers do not need an understanding of RLHF. However, fair warning that there will be some math involved - mostly probability, algebra, and optimization - but I will do my best to explain everything clearly.

Training, tuning, and aligning LLMs

To contextualize DPO, and preference-tuning in general, let’s review the modern process for creating language models such as ChatGPT or Claude. The following steps are sequential, with each one building upon the previous:

  1. Pre-train a base model on internet-scale data. Given a snippet of text, this model is trained to predict the immediate next word. This conceptually simple task scales up extremely well and allows LLMs to encode a huge amount of knowledge from their training data. Examples of base models include GPT-3, Llama3, and Mistral.

  2. Take a pre-trained base model and fine-tune it on a task-specific dataset of demonstrations. For example, if you are trying to create a helpful dialog model like ChatGPT, you would want to tune your model on a dataset of conversational dialog, so that your model’s outputs sound more like parts of a conversation and less like a Wikipedia page. In this stage, we still use the next word prediction task, and the fine-tuning procedure updates our model to make predictions that more closely align with the high-quality task-specific examples we are feeding it. Examples of fine-tuned models in this stage are Alpaca, Vicuna and Mistral-Instruct.

  3. Finally, we fine-tune the model based on human preferences. Human preferences are powerful because they are so easily and cheaply expressed. Think of how easy it is to compare two movies and pick a favorite. Yet how difficult it would be to make a film that embodies the qualities that drive you to visit a theater. Similarly, it is challenging to describe exactly how we want our model to behave (as we attempt to do in step 2), but given examples of model behavior it is straightforward to indicate a preference for a specific type of behavior. For a while, this sort of preference-tuning was done using RLHF. Recently, RLHF has been somewhat supplanted by DPO due to the relative simplicity of the latter. LLMs that have been tuned using human preferences include Llama 3 Instruct, ChatGPT-4, Claude 3 Opus, and Gemini Ultra.

The Gemini whitepaper provides a nice visual representation of these stages:

LLM Training Stages

Tuning LLMs on preference data

It is hard and time-consuming work to create high-quality demonstrations of the behavior we want our LLM to mimic. And it would be expensive to hire labelers to help us create such data. However, once we have a model that is “good enough” at demonstrating desired behavior, we can shift into high gear. Given a prompt, we can sample two different responses from our LLM by injecting a small amount of randomnessThis is typically done by generating text with a temperature that is greater than zero. Here is a lovely little demo that explains how temperature affects model outputs visually.. Now, it is cheap and easy to have a labeler express a preference for one of the two completions.

While using ChatGPT or Gemini, you may have noticed that you will occasionally be asked to choose between two similar answers from which to continue your conversation. This preference is recorded and used to improve the model in a future round of preference-tuning. Similarly, Chatbot Arena collects preference data for the purpose of rating LLMs based on human assessments:

LMSys Chatbot Arena, a head-to-head comparison tool for instruction-tuned LLMs

There are many publicly available preference datasets, such as LMSys’ Chatbot Arena Conversations dataset, OpenAI’s WebGPT Comparisons dataset, and Anthropic’s Helpfulness-Harmlessness RLHF dataset (explicit/offensive content warning).

Formally, these datasets can be expressed as follows:

D={x(i),yw(i),yl(i)}i=1N\mathcal{D}=\{x^{(i)},y_w^{(i)},y_l^{(i)}\}_{i=1}^N

Where xx is the context/prompt, ywy_w is the preferred completion, and yly_l is the less desirable completion.

The Bradley-Terry Model

So what do we do with all this preference data? We want to leverage it to modify our LLM to output responses that better conform to the preferences. To begin, let us explore a simple probability model:

p(ij)=sisi+sjp^*(i \succ j) = \frac{s_i}{s_i + s_j}

This is the Bradley-Terry model, which is a model for the outcome of pairwise comparisons. In plain English, it says "We model the trueThis is the reason for the “star” in pp^*: to indicate that we are modeling the true underlying distribution of human preferences. Likewise, shortly we will see rr^*, which indicates the true underlying reward function that grades our completions, and π\pi^*, which indicates the optimal policy we want our LLM to mimic. probability that outcome ii is preferred to outcome jj as the score of ii over the combined scores of ii and jj".

Readers may be familiar with the Bradley-Terry model from the context of Elo scores, which are popular in chess and other competitive games. The Bradley-Terry model is a generalization of the Elo rating system, where the probability of player A beating player B is given by p(AB)=11+10(RBRA)/400=sAsA+sBp(A \succ B) = \frac{1}{1 + 10^{(R_B-R_A)/400}} = \frac{s_A}{s_A + s_B}. Here RR indicates a player’s ratingSo if player A’s Elo rating is 2000 and player B’s is 1600 then player A is expected to be 10 times more likely to win than player B, because p(AB)=11+10(16002000)/400=10/11p(A \succ B)=\frac{1}{1 + 10^{(1600-2000)/400}}=10/11. and s=10R/400s = 10^{R/400}.

Under the Bradley-Terry model, is common to choose to parameterize the score as s=ers=e^r, where rr stands for reward. The term “reward” is borrowed from the world of reinforcement learning, where greater rewards are received for a more desirable series of actions - similar to achieving a higher score for performing better in a video game.

With this parameterization, our model starts to look pretty nice - a simple difference in reward values passed through the logistic functionThe logistic function is an S-shaped (or sigmoid) function commonly denoted using σ(x)\sigma(x). It frequently appears when working with probabilities because it can “squash” values in R\mathbb{R} (the set of all real numbers) into (0,1)(0, 1) (the set of probabilities values, excluding exactly 0 or 1). Sigmoid Function.

p(ij)=sisi+sj=erieri+erj=11+e(rirj)=σ(rirj)p^*(i \succ j) = \frac{s_i}{s_i + s_j} = \frac{e^{r^*_i}}{e^{r^*_i} + e^{r^*_j}} = \frac{1}{1+e^{-(r^*_i-r^*_j)}} = \sigma(r^*_i - r^*_j)

Applying the Bradley-Terry Model to LLMs

Now, we want to take the Bradley-Terry model and leverage it alongside a dataset of preferences in order to improve our LLM’s generated outputs.

In our preference dataset (D\mathcal{D}), we have two comparisons and we want to model the probability of one completion being preferred over the other. In a sense, each completion elicits some reward based on its quality, and our ultimate goal will be to nudge our LLM to produce completions that are of higher quality. Therefore, we will parameterize the reward using our LLM. We will call this reward r(x,y)r^*(x, y), which just means that the reward is a function of the context/prompt (xx) and the completion (yy).

So after adapting our preference model to use our parameterized reward function, we have:

p(y1y2x)=σ(r(x,y1)r(x,y2))p^*(y_1 \succ y_2 | x) = \sigma(r^*(x, y_1) - r^*(x, y_2))

But talking in terms of optimal solutions and rewards does us no good, since we do not have access to the optimal reward function. In practice, it is common to learn a reward model rϕ(x,y)r_\phi(x, y) that mimics the optimal reward function. We can estimate the parameters ϕ\phi of this reward model by framing this as a binary classification problem where our objective is to minimize the following negative log-likelihood loss function on our preference dataset D\mathcal{D}:E(x,y1,y2)D[f(x,yw,yl)]\mathbb{E}_{(x,y_1,y_2)\sim \mathcal{D}}[f(x,y_w,y_l)] is just a formal way of saying "the expected value of function ff on data points sampled from our preference dataset".

LR(rϕ,D)=E(x,yw,yl)D[log(σ(rϕ(x,yw)rϕ(x,yl)))]\mathcal{L}_R(r_\phi, \mathcal{D}) = -\mathbb{E}_{(x,y_w,y_l)\sim \mathcal{D}}[\log(\sigma(r_\phi(x,y_w) - r_\phi(x, y_l)))]

Under the RLHF framework, we could leverage this learned reward model in a reinforcement learning setting to optimize an LLM to output completions that achieve high rewards. However, DPO takes a different tack - instead of the two-stage RLHF process, DPO reparameterizes the Bradley-Terry model so that we can use a similar loss function to directly optimize the parameters of our LLM such that it produces outputs that are preferred by human observers.

The probability of a completion

At this point, the idea of optimizing LLMs based on preferences or rewards may feel fairly abstract. So we’re going to take a moment to introduce a new probability function, π(yx)\pi(y|x), that represents the literal output of our LLM. In reinforcement learning notation, π\pi indicates a policy (i.e. a strategy), and policies are optimized to maximize reward. Specifically, πθ(yx)\pi_\theta(y|x) is the probability of generating the completion yy based on an LLM with parameters θ\theta given that we start with prompt xx.

What do we mean by "the probability of generating the completion yy"? Our LLM is an auto-regressive text generator, and, upon each auto-regressive step, it computes a probability value for every wordIn practice, modern LLMs operate on tokens, not words. For our purposes, the difference doesn’t really matter. You can learn more by playing with an online tokenizer demo or digging through Karparthy’s minbpe repo. in its vocabulary.

Next Word Prediction Graphic So - proceeding in order through every word in completion yy - we compute the probability of the next word in the completion given all of the proceeding words. Now, we have a probability value for every word in the completion! So we can compute the joint probability of generating the sequence of words as the product of the individual probabilities of observing each word along the wayMultiplying probabilities can result in numerical underflow. It is common to instead work with logprobs: ipi=eilogpi\prod_i p_i=e^{\sum_i log p_i}. Since every term in the summation of logprobs increases the magnitude of its output, underflow is avoided. OpenAI has a nice guide to using token logprobs returned by an LLM.:

πθ(yx)=t=0ypLLMθ(ytx,y0:t)\pi_\theta(y|x)=\prod_{t=0}^{|y|}p_{LLM_\theta}(y_t|x,y_{0:t})

Another way to think about it is that there is a tree of possible completions and we are computing the probability of tracing one specific path from the root (end of the prompt) to a leaf (stop-token).

Probability of Sequence Graphic

When training, we know the entire text completion ahead of time, so, by applying a causal attention mask, we can calculate all of the the individual next-word probabilities (and thus πθ(yx)\pi_\theta(y|x)) via a single forward pass through our LLM.

Optimizing our LLM based on preferences

Ok, so now that we’ve got our framework in place. Let us remind ourselves of our goal: to improve the outputs of our LLM. Stated another way, we want the completion (y) our LLM provides for a prompt (x) to generate a large reward r(x,y)r(x, y). With this in mind, we can formulate an optimization problem where we want to find the parameters of our LLM (θ\theta) that maximize our expected reward for prompts similar to those we see in practice.ExD,yπθ(yx)[r(x,y)]\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(y|x)}[r(x, y)] is just a formal way of saying "the expected reward attained by completions generated/sampled from our model (yπθ(yx)y\sim \pi_\theta(y|x)) based on prompts sampled from our dataset (xDx\sim \mathcal{D})".

maxθExD,yπθ(yx)[r(x,y)]\max_{\theta}\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(y|x)}[r(x, y)]

This is a bit too simplistic, however. In practice, we start with the parameters of our fine-tuned base model, and we have some belief that the outputs generated by our fine-tuned base model are pretty good, so we don’t want the outputs of our model to change too much unless they improve the reward significantly. With that in mind, we amend our optimization problem to include a regularization constraint to help enforce this belief.

maxθExD,yπθ(yx)[r(x,y)]βDKL[πθ(yx)  πref(yx)]\max_{\theta}\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(y|x)}[r(x, y)] - \beta\mathbb{D}_{KL}[\pi_\theta(y|x) \ \Vert \ \pi_{ref}(y|x)]

DKL[PQ]\mathbb{D}_{KL}[P \Vert Q] is the Kullback-Leibler divergenceKL divergence is one of many traditional methods for regularizing an RL agent’s policy. In the cases of DPO and RLHF, it is a natural choice because we begin with a strong reference policy at hand - the LLM output by our fine-tuning procedure., a statistical distance measure. It quantifies how the probability distribution P differs from probability distribution Q. This constraint based on the KL divergence just encodes the idea that we want to penalize outputs from our model (πθ\pi_\theta) based on how much they differ from outputs from the fine-tuned model (e.g. the reference model) we started with (πref\pi_{ref}). β\beta is a scalar hyperparameter that controls the strength of the constraint.

Now, we want to derive the optimal solution to this optimization problem. This will rely on Gibb’s Inequality - the fact that DKL[PQ]0\mathbb{D}_{KL}[P \Vert Q]\geq0 and DKL[PQ]=0\mathbb{D}_{KL}[P \Vert Q]=0 if and only if P=QP=Q.The intuition here is that the KL-divergence is a distance measure (kind of), and there is no distance between P and Q if they are equal, and there must be some distance if they are not equal.

maxπθExD,yπθ(yx)[r(x,y)]βDKL[πθ(yx)  πref(yx)]=maxπθExD,yπθ(yx)[r(x,y)]βEyπθ(yx)[logπθ(yx)πref(yx)]=maxπθExDEyπθ(yx)[r(x,y)βlogπθ(yx)πref(yx)]=minπθExDEyπθ(yx)[logπθ(yx)πref(yx)1βr(x,y)]=minπθExDEyπθ(yx)[logπθ(yx)1Z(x)πref(yx)e1βr(x,y)logZ(x)]=...\max_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(y|x)}[r(x, y)] - \beta\mathbb{D}_{KL}\left[\pi_\theta(y|x) \ \Vert \ \pi_{ref}(y|x)\right] \\[10pt] =\max_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D},y\sim \pi_\theta(y|x)}[r(x, y)] - \beta\mathbb{E}_{y\sim \pi_\theta(y|x)}\left[\log\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right] \\[10pt] = \max_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi_\theta(y|x)}\left[r(x,y) - \beta\log\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right] \\[10pt] = \min_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi_\theta(y|x)}\left[\log\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} - \frac{1}{\beta}r(x,y)\right] \\[10pt] = \min_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi_\theta(y|x)}\left[\log\frac{\pi_\theta(y|x)}{\frac{1}{Z(x)}\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}} - \log Z(x)\right] = ...

where Z(x)=yπref(yx)e1βr(x,y)Z(x)=\sum_y\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}. Importantly, this Z(x)Z(x) term depends only on xx and πref\pi_{ref} and not on yy or πθ\pi_\theta. This lets us do a bit of reorganizing from where we just left off.

...=minπθExD[Eyπθ(yx)[logπθ(yx)1Z(x)πref(yx)e1βr(x,y)]logZ(x)]=minπθExD[DKL(πθ(yx)  1Z(x)πref(yx)e1βr(x,y))logZ(x)]...= \min_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}}\left[\mathbb{E}_{y\sim \pi_\theta(y|x)}\left[log\frac{\pi_\theta(y|x)}{\frac{1}{Z(x)}\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}}\right] - logZ(x)\right] \\[10pt] = \min_{\pi_\theta}\mathbb{E}_{x\sim \mathcal{D}}\left[\mathbb{D}_{KL}\left(\pi_\theta(y|x)\ \Vert\ \frac{1}{Z(x)}\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}\right) - logZ(x)\right]

And we have nearly arrived! Since Z(x)Z(x) does not depend on πθ\pi_\theta, we can just ignore it when deriving the optimal solution. We can now use Gibb’s inequality as mentioned above: DKL(πθ(yx)  1Z(x)πref(yx)e1βr(x,y))\mathbb{D}_{KL}\left(\pi_\theta(y|x)\ \Vert\ \frac{1}{Z(x)}\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}\right) is minimized at zero if, and only if, the two distributions on either side of \Vert are identical. So, the optimal solution (denoted as π\pi^*) to our optimization problem for all xDx \in \mathcal{D} is:

π(yx)=πθ(yx)=1Z(x)πref(yx)e1βr(x,y)\pi^*(y|x)=\pi_\theta(y|x)=\frac{1}{Z(x)}\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)}

Direct Preference Optimization

So we know the optimal solution to our optimization problem, but can we access it? No. The term Z(x)=yπref(yx)e1βr(x,y)Z(x)=\sum_y\pi_{ref}(y|x)e^{\frac{1}{\beta}r(x,y)} is intractable - computing it requires summing over every possible string of words.

Instead, we can reorganize the optimal solution from above such that we express the reward function in terms of the optimal policy πθ\pi_\theta, the reference policy πref\pi_{ref}, and the intractable function ZZ:

r(x,y)=βlogπθ(yx)πref(yx)+βlogZ(x)r(x,y) = \beta\log{\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}} + \beta\log{Z(x)}

This same reorganization can be applied using the underlying ground-truth reward rr^* and its corresponding optimal policy π\pi^*.

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r^*(x,y) = \beta\log{\frac{\pi^*(y|x)}{\pi_{ref}(y|x)}} + \beta\log{Z(x)}

Now here comes the clever trick noticed by the authors of DPO. We can use this reorganized expression of the optimal solution to our optimization problem to reparameterize the Bradley-Terry preference model from above so that it is expressed in terms of an optimal policy π\pi^* and not in terms of an underlying reward function! And even better, once we plug everything in, we notice that the intractable Z(x)Z(x) function cancels out!

p(y1y2x)=σ(r(x,y1)r(x,y2))=σ(βlogπ(y1x)πref(y1x)+βlogZ(x)(βlogπ(y2x)πref(y2x)+βlogZ(x)))=σ(βlogπ(y1x)πref(y1x)βlogπ(y2x)πref(y2x))p^*(y_1 \succ y_2 | x) = \sigma(r^*(x, y_1) - r^*(x, y_2)) \\[10pt] = \sigma\left(\beta\log{\frac{\pi^*(y_1|x)}{\pi_{ref}(y_1|x)}} + \beta\log{Z(x)} - \left(\beta\log{\frac{\pi^*(y_2|x)}{\pi_{ref}(y_2|x)}} + \beta\log{Z(x)}\right)\right) \\[10pt] = \sigma\left(\beta\log{\frac{\pi^*(y_1|x)}{\pi_{ref}(y_1|x)}} - \beta\log{\frac{\pi^*(y_2|x)}{\pi_{ref}(y_2|x)}}\right)

Now, with our reparameterized Bradley-Terry model, we can use supervised learning to directly learn a policy that mimics the optimal policy. We can minimize a negative log-likelihood loss function over our preference dataset D\mathcal{D} to estimate the parameters of our policy πθ\pi_\theta:

LDPO(πθ;πref)=E(yw,yl,x)D[log(σ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)))]=E(yw,yl,x)D[log(σ(β(logπθ(ywx)πθ(ylx)logπref(ywx)πref(ylx))))]\mathcal{L}_{DPO}(\pi_\theta;\pi_{ref}) = -\mathbb{E}_{(y_w,y_l,x)\sim \mathcal{D}}\left[\log\left(\sigma\left(\beta\log{\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}} - \beta\log{\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}}\right)\right)\right] \\[10pt]= -\mathbb{E}_{(y_w,y_l,x)\sim \mathcal{D}}\left[\log\left(\sigma\left(\beta\left(\log{\frac{\pi_\theta(y_w|x)}{\pi_\theta(y_l|x)}} - \log{\frac{\pi_{ref}(y_w|x)}{\pi_{ref}(y_l|x)}}\right)\right)\right)\right]

Recall that above we optimized a negative log-likelihood loss to estimate the parameters of a reward model that was then used downstream by RLHF to estimate the parameters of a policy model. But now we are directly optimizing the parameters of our LLM policy model based on human preferences! Thus, Direct Preference Optimization.

RLHF vs. DPO Graphic

To be explicit about the benefits of DPO over RLHF:

  1. We avoid the need to train a reward model to estimate human preferences.
  2. We avoid needing to perform any type of reinforcement learning, which is notoriously difficult and requires a lot of tribal knowledge to get right.
  3. We can directly optimize our LLM on human preferences using supervised learning, which is a much more straightforward and well-understood process.

The avoidance of reinforcement learning is particularly important. DPO has made preference-tuning a much more accessible process for practitioners who may not have the time, resources, or expertise to navigate the complexities of reinforcement learning.

Properties and Caveats of DPO

One of the key properties of DPO is that when the Bradley-Terry model perfectly fits our preference data and RLHF learns the optimal reward function, then the global optimizer of RHLF and DPO is the same.

This is an important equivalence result; however, in practice:

  1. The Bradley-Terry model often does not perfectly fit the preference data.For example, a preference cycle would cause the Bradley-Terry model to fail to perfectly fit the data. The Bradley-Terry model assumes transitive preferences. For example, if ABA \succ B and BCB \succ C then it expects that ACA \succ C. But if instead CAC \succ A, then there is a cycle and transitivity is broken.
  2. The reward function learned by RLHF will not be the optimal reward function.
  3. Gradient descent on a highly non-convex loss landscape - such as that of an LLM - does not find the global optimizer.

Another weakness of DPO is that it is prone to overfitting due to a lack of regularization. Azar et al. provide a compelling exampleThe original notation of the quote has been adjusted slightly to match the rest of this post.:

Consider the simple example where we have two actions y1y_1 and y2y_2 such that p(y1y2)=1p^*(y_1 \succ y_2)=1, i.e., y1y_1 is always preferred to y2y_2. Then the Bradley-Terry model would require that (r(y1)r(y2))+(r(y_1)-r(y_2))\rightarrow+\infty to [be satisfied]. If we plug this into the optimal policy then we would get that π(y2)π(y1)=0\frac{\pi^*(y_2)}{\pi^*(y_1)}=0 (i.e. π(y2)=0\pi^*(y_2)=0) … Thus the strength of the KL-regularization becomes weaker and weaker the more deterministic the preferences.

They also point out that, in practice, we have a finite amount of preference data. Therefore, we are likely to empirically estimate p^(y1y2)=1\hat{p}(y_1 \succ y_2)=1 simply because we’ve only seen a small number of comparisons between yy and yy'. Therefore the empirical optimal policy would push π(y2)=0\pi(y_2)=0 regardless of the regularization term that is attempting to keep the policy similar to our reference policy.

Despite these shortcomings, DPO is a highly effective tool; at the time of writing, many of the most successful and performant open-source LLMs were instruction-tuned using DPO.

Interested in learning more?

I highly recommend reading the DPO paper. In this post, we’ve done a deep dive into the derivation of the DPO objective, but the paper covers other points of interest, such as experimental results and additional theoretical properties.

And if you’re interested in learning more about preference-tuning in general, here are additional resources that provide a deeper dive into the topic:

References

[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv. https://arxiv.org/abs/2305.18290.

[2] Bertrand, Q., Czarnecki, W. M., & Gidel, G. (2023). On the limitations of Elo: Real-world games are transitive, not additive. arXiv. https://arxiv.org/abs/2206.12301.

[3] Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A General Theoretical Paradigm to Understand Learning from Human Preferences. arXiv. https://arxiv.org/abs/2310.12036.

[4] Jitkrittum, W. (2013). Log-Sum-Exp Trick to Prevent Numerical Underflow. http://wittawat.com/posts/log-sum_exp_underflow.html

[5] Gemini Team (2024). Gemini: A Family of Highly Capable Multimodal Models. arXiv. https://arxiv.org/abs/2312.11805.

[6] Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2020). What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. arXiv. https://arxiv.org/abs/2006.05990.