Nathan Lambert of the Allen Institute writes abouttheir (very strong) Olmo 2 32B release, and the just released Gemma 3 model from Google. One of the many interesting points:
Comparing Gemma 3 27B to OLMo 32B, the pretraining evaluations for both are super similar, but Gemma 3 scores are way better after post-training. The ceiling on post-training expectations has been shifting extremely fast among open models.
Given that Google have about the best crawling infrastructure in the world, and that Al2 have published the complete pretraining dataset used for Olmo, I think this is slightly surprising. You can see the benchmarks in the blogand technical report: for example, Gemma 3 27B gets 78.8 on winogrande from pretraining (a little below Gemma2 as it happens) while Olmo2 32B get 78.7.
The vibes have definitely shifted to post-training for where model differentiation is coming from, opening the question of what exactly is happening there. Nathan also posted about that recently, linking to this post by Mohit Ragavendra of Scale and Georgia tech:
The post looks at The Superficial Alignment Hypothesis, which is (largely) that post-training is just about preference tuning for behaviors the base model can already do
[…]
It initially seems like “Less Is More” in the sense that the LIMA model response was highly preferred by the GPT-4 evaluator for Math prompts (in-line with the work’s original claim). However, these model responses were also largely incorrect – the accuracy of models fine-tuned specifically for Math was substantially better, with the same data budget. If we went by subjective win-rate comparisons, we would have picked a model that was significantly worse.
In the post (and the two linked papers) Mohit breaks down how post-training actually helps. Starting with SFT, the work shows that mimicking style happens quickly, with relatively few samples.
with just a hundred finetuning examples, the model’s formatting mistakes were virtually solved – the model was perfect at mimicking the expected style.However, the model took a lot more supervised finetuning data to get better at reasoning – the substance of the task.
They find though that, largely, more-is-more when it comes to SFT, but that there is a power-law style scaling curve: big gains initially followed by slower, marginal gains. Adding in RL doesn’t change the fundamental curve, but it does shift it, leading more efficiently to the model gaining the reasoning capabilities they were training towards:
Preference data offers a weaker signal compared to supervised finetuning data. So, running DPO directly on the base model on reasoning tasks, is asking the model to learn a completely different response style from its reference model, with a weaker signal, while penalizing for being different from the reference model. Small amount of SFT on the base model teaches it the reasoning style and PFT can use the reward signal to focus on reasoning within the required response space.
I did wonder when reading this whether the results would look different with an online process (like PPO), rather than an offline. Luckily, Mohit links to another recent paper on this topic:
We prove that under idealized assumptions, online and offline PFT techniques should return policies of equivalent quality
but also
we observe that despite the lack of information-theoretic separation, online PFT out-performs offline PFT across different sampling distributions, labelers, and model sizes. Furthermore, it appears to be “easier” to learn a global RM than it is to learn a local RM, leading to higher validation likelihood.
The result here seems to be that the reward model is simply easier to model, and it helps “translate” the problem of the distribution.
This all feels like a continuum: at some level the superficial alignment hypothesis is directionally correct but its not that “superficial”: the base models have a lot of capabilities that are hard to elicit, and fine tuning/post training can juice them effectively, while adding some learning of its own (as more data is better!)
The best way of performing that elicitation turns out to be solving different problems at different levels: SFT for format, then RL for the deeper capability, and having a reward model effectively simplifies the learning process again.
