Generalization progress in RLHF

· December 29, 2024

https://swtheking.notion.site/4e0cbb325aaf458da710f0b36dbb239c?v=c9231e8c988b4d66a1d2dc34df4cf7b5

Breakdown of where generalization capabilities come from in RLHF – even more pertinent with the use of this for reasoning capabilities. Also includes one of the best concise overviews of RLHF I’ve read.

Fine-tuning with RL: Given a reward model, we employ RL to fine-tune the policy of a LLM. First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL (the dimension is approximately the size of vocabulary ^ length of the input token sequence). The reward function is a combination of the preference model and a constraint on policy shift. Finally, the update rule is the parameter update of the policy from PPO that maximizes the reward metrics in the current batch of data [3].