Constitutional AI: Training a Harmless Assistant from AI Feedback
Constitutional AI trains a harmless assistant with almost no human harm labels — a model critiques and revises its own answers against a written list of principles, then learns from AI-generated preferences (RLAIF).