https://arxiv.org/abs/2412.18925
A good paper on creating domain specific (medical in this case) reasoning LLM. I had been somewhat vague on how creating a reasoning model actually worked, and this paper felt very clear (as to one recipe at least). As I read it:
Start with a base model you are fine tuning for reasoning (in their case qwen 2.5) and a strong general model for verification and data generation (in this case GPT-4o)
- Gather a bunch of problems with ground-truth answers. In this case they had medical problems. Take about half of them for fine tuning data.
- Prompt the general model to think step-by-step for each problem to get some initial reasoning.
- Use the general model to evaluate whether each answer reached is correct vs ground truth.
- If the reasoning gets to the wrong answer (which is likely!) try a search strategy randomly – e.g. backtracking (start from an earlier step), critique-and-correct the existing line of reasoning, explore a new path distinct from the one given, or verify the current reasoning. Again, this is done by prompting the general model.
- Each problem gets three tries to get to a correct answer before giving up. At the end of this process we have a series of reasoning traces that get to the correct answer. Pair these with the problems and used for the next steps, but first rewrite them into a chain of thought incorporating “hmms” and other smooth transitions between thoughts, via prompting the general model.
- Fine tune the base model on the problem traces.
- Do RL (PPO) on the fine tuned model with the other half of the problems, rewarding when the model is correct with reasoning (verified by the general model), giving a small reward for a incorrect answer but with reasoning, and giving 0 for no reasoning provided (regardless of answer). Also constrain with KL divergence from the fine tuned model, as is standard in RLHF etc.
This seems like a pretty reproducible recipe and the results seem strong. They include the prompts they use in the appendix, helpfully, and some good ablations/practical notes.
- Effectiveness of Complex CoTs: We further examined the impact of different types of Chain-ofThought (CoT) reasoning. The results show that direct learning of response (yˆ) performs the worst, while simple CoT (y0, e0) offers only little benefit. In contrast, Complex CoT (yˆ0, eˆ) significantly improves performance by an average of 4.3 points. This demonstrates the importance of teaching models to refine their answers with reflection
One other interesting note on sourcing data was that they used GPT-4o to filter a number of multiple choice questions to generate the set of problems and ground truth. They used it to evaluate whether questions were complex enough to require reasoning, and whether they had a single clear and unambiguous answer. I am guessing it is a lot easier to get multiple choice question banks than other kinds, so this is a clever approach.
