Fine-tuning and RLHF — Physea Wiki

After pretraining, a base model is shaped with two more steps: fine-tuning on good example answers, then learning from human feedback on which answers people prefer. This is what turns a next-word predictor into a useful assistant.

A base model can predict the next word, but that is not the same as being helpful. Ask it a question and it might continue with more questions, because that is a plausible way text flows. Two further training steps fix this.

The first is fine-tuning: keep training the model, but now on a smaller, carefully chosen set of good answers. In the InstructGPT work, the team “collected a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning.”^[1] The model is still adjusting its weights, just on examples of the behavior we actually want.

The second step learns from human preferences, often called RLHF (reinforcement learning from human feedback). People rank different model outputs from best to worst, and the model is tuned to produce the kind people prefer. The same team “collected a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback.”^[1]

These two steps matter more than raw size. Human raters preferred the answers from a much smaller tuned model over a far larger untuned one: “outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.”^[1] Good shaping can beat sheer scale.

Fine-tuning tooling

Hugging Face TRL ↗
An open library for post-training models with supervised fine-tuning, reward modeling, and reinforcement learning.

References

Training language models to follow instructions with human feedback (arXiv:2203.02155) — Ouyang et al., OpenAI

How is a raw model turned into a helpful assistant?

Fine-tuning tooling

References