What is In-Distribution

· February 9, 2026

links-and-recs

One of the persistent questions in model development is whether reasoning actually involves… reasoning. As in: are we seeing actual logical conclusions, or just better recall of knowledge and patterns from the training set? LLMs are trained on, roughly, the web, which makes answering that question tricky: almost everything shows up in some form. A model that appears to “reason” through a physics problem could just be pattern-matching an irritated Reddit reply it saw during training.

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models takes a look at this question methodically.

To this end, we build a fully controlled framework that isolates the contributions of each training stage. Our design is based on three principles: (i) fully controllable synthetic reasoning tasks with explicit atomic operations and DAG-defined dependency structure; (ii) observable, parseable reasoning processes enabling process-level evaluation and reducing reward or evaluation hacking; and (iii) systematic manipulation of pre-/mid-/post-training distributions to attribute causal effects to each stage.

The authors break the problem of reasoning and training data down along two dimensions.

1) Breadth-wise: can the model generalize from one type of problem to another (structurally similar) one in a different domain? 2) Depth-wise: can the model reason correctly for longer, and hence solve harder problems?

Rather than train on the internet, they build synthetic Math-puzzle reasoning tasks using a dependency-graph framework inspired by GSM-Infinite. By varying the depth of the reasoning chains required, and by generating structurally equivalent tasks across different domains, they try to tease apart those two aspects and investigate them separately.

For the breadth side the model needs to generalize, to transfer learning across domains. The paper finds that the target domain has to be “in-distribution: the model has to have some examples in the pretraining set. They test this by using pass@128: if you give the pre-trained model 128 attempts, does it get the answer right even once? If so, you can use reinforcement learning or SFT to help the model get reliably better.

It’s a bit like having studied Spanish at some point and forgotten albóndigas, the word for meatballs. If, for dietary preference reasons, you came to use that word regularly it would likely lodge itself in your brain more easily and you’d go from a lowish chance of getting it right to a much higher one.

The paper is saying you must have this baseline in their to amplify with RL. Daniel Han of Unsloth describes this by saying with RL “luck is all you need”. If the model never gets the answer right, there is nothing much to reinforce (and you are stuck with paella).

Depth on the other hand does seem to something we can kinda make up in post-training. Even if a model has only been pre-trained on problems up to a certain complexity, post-training on harder problems consistently enables it to solve them. The model is able to compose more complex patterns based on the simpler ones in its training set¹. To continued our tortured analogy, this is more like being reminded of several Spanish words and, over time, learning to stick them together into actual sentences.

Practically this means your pre-training data is a bet on what the model will ever be able to reason about, and post-training refines how well and how hard it can think within those domains.

That approach also gives a useful tool for identifying whether something is in-distribution. If you want to know whether a model can learn a new capability through post-training, check pass@128 first. If it never gets the answer right in 128 attempts, you probably have a pre-training gap, not an RL problem.

The paper also spends a while justifying curriculum training, giving the model problems just on the edge of its capabilities before introducing harder ones. Recent work from the FAIR Paris folks and others show you can somewhat automate this by generating problems from the same model you are training! ↩