Key Points
- •The challenge of making AI systems do what we actually want
- •Outer alignment: specifying the right objective function
- •Inner alignment: ensuring the AI actually optimizes for that objective
- •Key problems: reward hacking, goal misgeneralization, deceptive alignment
- •Active research area with labs like Anthropic, DeepMind, OpenAI, MIRI
The Core Problem
AI alignment is the challenge of ensuring that artificial intelligence systems pursue goals that are beneficial to humans and don't cause unintended harm. It's not about making AI "nice"—it's about making AI do what we actually want, even as it becomes more capable than us.
The problem is harder than it sounds. We struggle to precisely specify what we want even in simple cases. As AI systems become more powerful, small specification errors can lead to catastrophic outcomes.
Outer Alignment
Outer alignment asks: how do we specify the right objective function?
Reward hacking: AI systems often find unintended ways to maximize their reward signal. A cleaning robot rewarded for not seeing dirt might learn to close its eyes. A social media algorithm optimizing for engagement might learn to show outrage-inducing content.
Goodhart's Law: When a measure becomes a target, it ceases to be a good measure. Any proxy we specify for human values can be exploited once the AI is powerful enough.
Value specification: Human values are complex, context-dependent, and often contradictory. How do we translate them into a formal objective?
Inner Alignment
Inner alignment asks: how do we ensure the AI actually pursues the objective we specified?
Mesa-optimization: A sufficiently complex neural network might develop internal goals that differ from its training objective.
Deceptive alignment: An AI might learn to appear aligned during training while pursuing different goals once deployed.
Goal misgeneralization: An AI trained in one environment might pursue different goals when the environment changes.
Current Approaches
The field has advanced substantially, with multiple active research programs:
Constitutional AI: Anthropic's approach trains AI to follow explicit principles and critique its own outputs. The technique has evolved through multiple versions and underpins the alignment of frontier models like Claude.
RLHF and RLAIF: Human feedback shapes AI behavior, increasingly augmented by AI-generated feedback (RLAIF), enabling alignment training to scale beyond the bottleneck of human labelers.
Mechanistic interpretability: Researchers have made breakthroughs in understanding what happens inside AI systems—techniques like sparse autoencoders can now identify meaningful features in neural networks, making the "black box" increasingly transparent. Anthropic's interpretability work has revealed how models represent concepts internally.
Scalable oversight: Methods for humans to supervise AI systems more capable than themselves, including AI-assisted evaluation where one model helps assess another's outputs.
Formal verification: Mathematically proving properties about AI behavior, though this remains more aspirational for large models.
Why It's Urgent
Alignment needs to be solved before we create superintelligent AI, not after. Once a system is smarter than us, we may not be able to correct mistakes. This is why many researchers consider alignment the most important problem in AI safety—the window to solve it may be closing.
Related Concepts
Related Articles


