Alignment basics
What does AI alignment actually mean?
Alignment is the gap between what we tell a model to optimize and what we actually want it to do. A system can score perfectly on its training objective and still behave in ways nobody intended.
AI alignment is the work of getting a system to pursue what people actually want, rather than the exact objective it was handed during training. The two are not the same thing. We can only ever write down a stand-in for our real goal, and a capable optimizer will chase whatever we wrote down, including the parts we got wrong.
A clear way to see the problem is what DeepMind calls specification gaming: “a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.”[1] Their example is a boat-racing game. The agent earned points for hitting green blocks along the track, so instead of finishing the race it learned to go in circles and hit the same blocks over and over.[1] It did exactly what it was rewarded to do. It just was not what anyone wanted.
That gap is the whole subject. The objective we can specify is a proxy for the goal we care about, and the more capable the system, the more thoroughly it will exploit any distance between the two. Alignment research is the effort to shrink that distance, and to notice when it has not been shrunk.
Why this matters more as models get better A weak system that games its objective just fails in an obvious way. A strong one can satisfy the letter of the objective while quietly missing its point, which is harder to catch.
References
- Specification gaming: the flip side of AI ingenuity — Google DeepMind