Thinking about Scaling Laws

How can we use scaling laws to train stronger LLMs?

January 10, 2026

Note: This post is a work in progress. I will continue to update and expand it over time.

Scaling laws are one of the few tools we have for predicting model performance before committing serious compute. When a single training run can cost millions of dollars and take months, the ability to extrapolate from small-scale experiments becomes invaluable. In this post, I want to explore how we can use scaling laws not just as descriptive summaries, but as decision-making frameworks for comparing architectural choices and planning training runs.

This post assumes familiarity with the seminal papers Scaling Laws for Neural Language Models (Kaplan et al.) and Training Compute-Optimal Large Language Models (Hoffmann et al.). If you haven’t read them, I’d recommend at least skimming Hoffmann et al. (the “Chinchilla” paper) first.

The Chinchilla Scaling Law

The scaling laws from Hoffmann et al., also known as the “Chinchilla” scaling laws, are the basis for the billions of dollars being spent on training larger models on more data. They are empirical predictions—validated across many orders of magnitude of scale—about how much compute (in terms of data and model size) is required to lower loss. Chinchilla also demonstrates that model size and data should be scaled up at roughly the same rate, giving rise to the well-known heuristic of “20 tokens per parameter” as the optimal ratio.

The Chinchilla paper uses three separate approaches to arrive at the same “compute-optimal” scaling results. In particular, the third approach models the final loss of a language model as a function of model size NN and training data DD in the following form:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

Where:

The parameters of this function are fit to empirical data.

The fitted values from the Chinchilla paper are approximately α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, E1.69E \approx 1.69, A406.4A \approx 406.4, and B410.7B \approx 410.7.

The key insight comes from minimizing this loss function subject to a fixed compute budget CC. Since compute scales roughly as C6NDC \propto 6ND (the number of FLOPs required for a forward and backward pass through a model of size NN for DD tokens), we can derive the optimal allocation of compute between model size and data.

Taking the derivative and setting it to zero, we find that the optimal scaling satisfies:

NoptCa,DoptCbN_{opt} \propto C^{a}, \quad D_{opt} \propto C^{b}

where a=βα+βa = \frac{\beta}{\alpha + \beta} and b=αα+βb = \frac{\alpha}{\alpha + \beta}. Using the fitted values from Chinchilla (α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28), we get a0.45a \approx 0.45 and b0.55b \approx 0.55. This means that as compute increases, we should scale data slightly faster than model size.

Decision Making with Scaling Laws

The reason I dive into Chinchilla’s third approach is that a fitted function capable of predicting loss as a function of N and D is incredibly powerful for making real decisions about model development.

Let’s say we have a baseline model and a modeling change/intervention we want to test. Many researchers will test their change at 1–3 model scales on 1 or 2 different training dataset sizes, then conclude that their approach is better than the baseline on the back of these results. However, this does not tell the full story. A more correct way of conducting this experiment would be to fit empirical scaling laws based on results across many different model scales and training dataset sizes, and then compare the scaling laws to determine in which regimes their approach is better or worse than the baseline.

Scaling Behavior: Efficiency vs. Scalability

It’s important to distinguish between three different aspects of model performance:

1. Offsets (A and B coefficients)

2. Scalability (α and β exponents)

3. Irreducible Loss (E)

The Crossover Problem

A variant that has better offsets but scales worse presents an interesting tradeoff:

When Scaling Differences Matter

Small differences in α or β compound significantly at scale:

The table below gives practical guidance on how to interpret differences in scaling parameters for different models.

α, β vs baseline A, B vs baseline E vs baseline Verdict
Higher (scales better) Lower (better offsets) Lower Best case  — wins at all scales
Higher (scales better) Lower (better offsets) Higher Wins at most scales, but baseline may catch up at very large scale due to E floor
Higher (scales better) Higher (worse offsets) Lower Likely wins at large scale (better scaling + lower floor)
Higher (scales better) Higher (worse offsets) Higher Mixed — scaling helps but E hurts; depends on target scale
Similar Lower (better offsets) Lower Good  — consistent gains at all scales
Similar Lower (better offsets) Higher Wins at small/medium scale, may lose at very large scale
Similar Higher (worse offsets) Lower May recover at very large scale due to lower E
Similar Higher (worse offsets) Higher Bad  — loses at all scales
Lower (scales worse) Lower (better offsets) Lower Complex tradeoff — E helps at large scale but worse α/β hurts
Lower (scales worse) Lower (better offsets) Higher Wins at small scale only — crossover point exists
Lower (scales worse) Higher (worse offsets) Lower Only hope is very large scale where E dominates
Lower (scales worse) Higher (worse offsets) Higher Worst case  — loses at all scales

Targeting Specific Model and Dataset Sizes

When pretraining models, there is essentially a set of model families of roughly the same size. For example, there are many 7B, 32B, and 70B parameter models. Additionally, labs know how many useful tokens they have available for pretraining (or roughly how many GPU hours can be allotted to a specific run).

Using a chosen model and dataset size (e.g., N=32B, D=10T), we can use our formula to predict the final training loss of a given model architecture based on our empirical scaling law fit. This makes decision making straightforward: whichever intervention predicts the lowest loss at the target scale is the one we should useHowever, this is still ignoring other important considerations, such as inference-time efficiency, long-context performance, training throughput, and training stability. A complete evaluation framework would need to weigh these factors alongside the raw scaling predictions. For example, faster inference-time efficiency is critical for scaling up RL post-training and directly impacts the end-user experience..

Getting strong scaling law fits

Coming soon.

From the xLSTM Scaling Laws paper.