DeepSeek R1 & GRPO

· January 25, 2025

note-to-self

DeepSeek dropped another quality release with their R1 series. The models are an exploration of how to improve improving reasoning capabilities in LLMs. They’ve released a crop of models, which, despite some quirks in its licensing, are extremely helpful. The paper continues the trend of being clear and open.

Core Approaches for Developing Reasoning Capabilities

DeepSeek’s approach to reasoning in LLMs is built on three distinct approaches

1. Reinforcement Learning (RL) on a Base Model (DeepSeek-R1-Zero)

DeepSeek-R1-Zero applies RL directly to the base model without relying on any supervised data.

Training Approach:
- Leveraged Group Relative Policy Optimization (GRPO), earlier work by the same folks, a simplified version of PPO that avoids the need for a large critic model by basing optimization on group-level scores.
- Effectively it generates multiple completions per prompt, scoring them using a combination of reward models and reward functions:
  - Accuracy rewards: Validating correctness for problems like coding or math with deterministic solutions.
  - Format rewards: Encouraging structured reasoning traces using tags.
- Rewards were averaged across completions, with constraints applied using KL divergence to maintain proximity to the base model.
Emergent Behaviors:
- Spontaneous reflection and backtracking during training.
- Majority voting across completions further boosted the model’s reasoning accuracy.
- Over time, the model naturally extended its “thinking” process, solving increasingly complex tasks.
- I saw a bit of skepticism about the “Aha” moments (the model demonstrating backtracking), with suggestions that perhaps they had some O1 data in the training mix. GitHub – Jiayi-Pan/TinyZero came out today that reproduced the recipe from R1-Zero and saw the same behavior, based on Qwen-2.5 3B. While that doesn’t totally rule it out, I think it does likely indicate that this method is a pretty sound way of eliciting reasoning.

2. Supervised Fine-Tuning (SFT) with Chain of Thought (CoT) + RL

Building upon the groundwork of R1-Zero, DeepSeek-R1 is a supervised finetune on DeepSeek V3 with reasoning trace/cha of thought data.

Data Preparation:
- Generated ~600k CoT examples by iteratively fine-tuning on outputs from converging models (e.g. run it, get some good traces, add that to training set, rinse and repeat)
- Supplemented the dataset with ~200k non-CoT examples to ensure the model learns to use CoT selectively and appropriately (they used data from the preexisting Deepseek v3 fine tuning set)
Training Process:
- Fine tune on the 800k
- Applied RL exactly as with R1-Zero to refine reasoning capabilities, using diverse prompts (including some non-Chain-of-Thought ones) and classic preference alignment techniques (e.g. don’t chain of thought “hello”) to optimize response quality and coherence.
Outcome:
- This model seems both good at reasoning, and generally pretty strong. Its not everything everywhere, but this feels like a plausible recipe towards general purpose models, though they note it is a bit worse at tool use etc. than base V3.

3. Distillation to Small Dense Models

In a really interesting extension, DeepSeek distilled the reasoning knowledge of R1 into compact versions based on models like Qwen and Llama.

Process:
- Fine-tuned smaller models with the 800k training examples used before, but didn’t do any fine tuning.
- The distilled models exhibited strong reasoning capabilities, outperforming earlier open-source baselines.
Observations:
- Distillation did not involve direct logic comparison, likely due to the challenges posed by token set differences.
- RL stages were not applied to distilled models, but researchers suggested that doing so could enhance their performance further.

Challenges and Unsuccessful Attempts

I really like they called out some things that didn’t work:

Process Reward Model (PRM):
- Attempted to reward reasoning steps individually rather than focusing on end results.
- Faced challenges like reward hacking and inconsistencies in defining intermediate reasoning steps.
Monte Carlo Tree Search (MCTS):
- Explored breaking problems into smaller parts and systematically searching for solutions.
- Encountered exponential search space complexity, local optima issues, and difficulties in training reliable value models to guide search steps.

Additional Observations

Few-shot Prompting:
- Observed that few-shot prompting degraded R1’s performance, a behavior also noted by folks working with OpenAI’s o1 series.
Software Engineering challenges:
- Slow evaluations limited the application of large-scale RL to software engineering tasks.
- They highlighted the need for methods like rejection sampling (have “not chosen” examples) or asynchronous evaluations to address these inefficiencies.

It is extremely cool to see that pure RL can push a model to improve reasoning capabilities. The approach of developing traces, fine tuning, and mixing in capabilities feels like a very practical approach as well. I am really looking forward to people exploring further!