Alignment basics
What is the difference between outer and inner alignment, and why is alignment hard?
Outer alignment asks whether the objective we train on matches what we want. Inner alignment asks whether the model that comes out actually pursues that objective, rather than a stand-in goal that only looked right during training.
The 2019 paper “Risks from Learned Optimization” splits the alignment gap into two separate failures. Both have to go right, and either can go wrong on its own.[1]
Outer alignment is about the objective itself: does the thing we train on match what we actually want? The boat-racing agent that circled to farm points instead of finishing the race is an outer failure, the reward was a bad proxy for the goal.[3] RLHF and Constitutional AI are both attempts to specify a better objective.
Inner alignment is a quieter problem. Suppose the objective is genuinely good. The model that training produces still has to actually adopt that objective rather than some other goal that happened to score well during training. The authors call a learned system that is itself pursuing a goal a mesa-optimizer, and its internal goal the mesa-objective; inner alignment is the problem of getting that internal goal to match the one we trained on.[1][2] A model could behave perfectly in training because that was the winning move there, then pursue something different once it is out in the world.
This is why alignment is hard. Outer alignment fails because we cannot fully write down what we want. Inner alignment fails because we cannot directly read the goal a trained model ended up with; we only see its behavior on the cases we tested. A system can pass every test we run and still hold an objective we never chose, and the more capable it is, the higher the stakes of that gap.
References
- Risks from Learned Optimization in Advanced Machine Learning Systems — Hubinger et al., arXiv 1906.01820 (2019)
- Risks from Learned Optimization: Introduction — Hubinger et al., AI Alignment Forum
- Specification gaming: the flip side of AI ingenuity — Google DeepMind