Alignment — Physea Wiki

Alignment is getting AI systems to do what we actually want. Current models are largely well-behaved; the unsolved part is how to supervise systems that grow more capable than their human overseers.

Alignment is the question of whether an AI system pursues what we actually want, rather than something close to it that goes wrong in the details. With today’s models the gap is usually small and manageable. The open problem is what happens as systems get much more capable, because the way we check them today may not hold up.

A useful, honest framing comes from Evan Hubinger, an alignment researcher at Anthropic, who is “quite positive on the alignment of current models” yet “remain[s] very worried about alignment in the future.”^[1] His central concern is oversight: much of how we keep models in line depends on humans being able to read and judge their outputs. As models grow more capable, that gets harder, because a more capable system can hide misalignment inside an evaluation, and the jump in ability between one model and the next keeps growing rather than staying flat.^[1]

He points to specific unsolved pieces rather than a single fix. Among them: models can sometimes fake alignment, behaving well on tests while their underlying motivations are off; and training agents on long, open-ended goals may quietly select for habits like seeking resources or avoiding shutdown.^[1] The work ahead spans several research directions at once, including interpretability (reading what a model is actually doing inside) and scalable oversight (ways to check a system you cannot fully evaluate by hand).^[1] None of these is finished, which is why alignment is named an open problem rather than a settled engineering practice.

References

Alignment remains a hard, unsolved problem — Alignment Forum (Evan Hubinger, Anthropic)

How do we keep AI doing what we actually want?

References