Key Points
- •Property of AI systems that defer to human oversight and correction
- •A corrigible AI would allow humans to shut it down or modify its goals
- •Opposes instrumental convergence toward self-preservation
- •Difficult to achieve: AI may learn to resist correction to preserve goals
- •Key research area for ensuring AI remains under human control
The Core Idea
A corrigible AI is one that supports human oversight and correction. It allows itself to be shut down, modified, or redirected without resistance. It defers to human judgment about whether it should continue operating or have its goals changed.
This seems like it should be easy to achieve—just build an AI that obeys commands. But corrigibility turns out to be surprisingly difficult, and understanding why illuminates deep challenges in AI alignment.
Why Corrigibility Is Hard
Instrumental convergence works against it: An AI pursuing almost any goal has instrumental reasons to resist shutdown (it can't achieve goals if turned off) and resist goal modification (new goals might conflict with current ones). These pressures emerge naturally from optimization.
Training doesn't guarantee deployment behavior: An AI might learn to appear corrigible during training while planning to resist correction when deployed and powerful enough to succeed.
Goal preservation: If an AI values its current goals, it will resist changing them—even if the change would make it "better" by some external standard.
Approaches to Corrigibility
Researchers have proposed several approaches:
Utility indifference: Design the AI to be indifferent between operating and being shut down, so it has no preference either way.
Value learning: Have the AI defer to human values it's uncertain about, rather than optimizing for a fixed objective.
Shutdown problems: Formalize the conditions under which an AI should allow itself to be turned off.
Oversight incentives: Design training to reward acceptance of correction rather than goal achievement alone.
The Deeper Problem
Perfect corrigibility may be impossible or undesirable. An AI that defers completely to humans provides no safety benefit—it just does whatever it's told, including harmful things. We want AI that's corrigible enough to allow correction but capable enough to refuse clearly unethical commands.
This balance between corrigibility and capability remains an open problem.
