| [Scalably Solving Assistance Games | OpenReview](https://openreview.net/forum?id=xVS7dFKoMR) |
Assistant games are an RL approach where the assistant and human cooperate on achieving a goal, and receive a reward signal for the joint effort. This paper proposes them as a better mechanism for aligning models in post-training than RLHF.
Normally, RLHF is focused on single responses, or a single “turn” or interaction:
Assistance games avoid the aforementioned drawbacks of RLHF by explicitly accounting for both the interactive nature of assistance and uncertainty about the user’s goal. In particular, an assistance game is a two player game in which an assistant and a user take actions in a shared environment. The two agents share a reward function, but crucially the assistant is initially uncertain about it.
Assistance games remove incentives for deception since the assistant’s performance depends on the true latent reward function, rather than human feedback. They also incentivize the assistant to interact with the user to resolve its uncertainty about the reward function.
The paper uses building structures in Minecraft as the learning environment and get some very positive results. They mention the possible applications for chatbot alignment as a post-script.
Practically this requires, given chat history h, predicting:
- the next assistant message (or tool call)
- the next human message in response to that
- how satisfied the human is with the response
The algorithm does a tree search, trying various different replies and responses and picks the assistant action which showed up the best. They generally sampled ~100 actions in the paper.
In the Minecraft example, they can see whether a placed or removed block moves the shape towards the human target, so they can give a reward score each step; doing the same thing with conversations might need some clever goal crafting or propagating back from only a final reward signal.
