RLHF — Physea Wiki

RLHF teaches a model what people prefer by having humans compare its outputs, training a reward model on those comparisons, then optimizing the model against that reward. It is the main reason modern chat assistants feel helpful.

Reinforcement learning from human feedback, or RLHF, is one answer to the proxy problem from the previous page. Some goals are hard to write down as a rule but easy for a person to recognize. RLHF turns that recognition into a training signal.

The core idea was shown in 2017 by Christiano and colleagues. Instead of giving the system a reward function, they showed people pairs of the agent’s behavior and asked which one looked better. From those comparisons they trained a separate reward model that predicts human preference, then used that model as the reward the agent optimizes against.^[1] Strikingly, this worked on hard tasks like Atari games and simulated robot motion while “providing feedback on less than one percent of our agent’s interactions with the environment.”^[1]

Applied to language models, the same recipe is what makes a raw text predictor into a usable assistant. People rank model responses, a reward model learns to imitate those rankings, and the model is tuned to produce answers the reward model scores highly.

The catch is that RLHF aligns the model to the reward model, and the reward model is itself a proxy for human judgment. If the reward model has blind spots, the trained model will find and exploit them. RLHF narrows the alignment gap; it does not close it.

References

Deep reinforcement learning from human preferences — Christiano et al., arXiv (NeurIPS 2017)

What is reinforcement learning from human feedback?

References