Constitutional AI — Physea Wiki

Constitutional AI replaces human labels for harmful content with a short written set of principles the model uses to critique and revise its own answers, then learns from AI-generated preferences instead of human ones.

RLHF needs people to label which answers are harmful, which is slow, costly, and hard on the labelers. Constitutional AI, introduced by Anthropic in 2022, is an attempt to get harmlessness with far less of that human labeling.

The method trains a harmless assistant “without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles.”^[1] That short list of principles is the “constitution.” It works in two stages. First, the model is asked to critique and revise its own responses against the principles, and is then fine-tuned on those revised answers.^[1] Second, the model judges which of two answers better fits the constitution, and those AI-generated preferences train a reward model, a swap the authors call “RL from AI Feedback (RLAIF).”^[1]

The point is not that humans drop out of the loop. They still write the principles, which makes the values explicit and editable instead of buried inside thousands of individual labels.^[2] It also lets oversight scale: a model can check far more of its own outputs against a rule than people can label by hand. The honest limit is that a model judging its own answers can share the same blind spots it is being asked to catch.

Where to read the primary source

Anthropic: Constitutional AI ↗
The research overview and the full paper describing the constitution-based training method.

References

Constitutional AI: Harmlessness from AI Feedback — Bai et al., Anthropic, arXiv 2212.08073
Constitutional AI: Harmlessness from AI Feedback (overview) — Anthropic

What is Constitutional AI?

Where to read the primary source

References