
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Researchers identify a fundamental failure mode in self-distillation for math reasoning: privileged context inflates model confidence on structural tokens while suppressing deliberation signals needed for multi-step search. Anti-Self-Distillation inverts the training objective, maximizing divergence between student and teacher to preserve exploratory reasoning patterns. This addresses a critical gap where standard distillation succeeds in language tasks but fails in reasoning, suggesting that reasoning requires fundamentally different training dynamics than pattern matching. The finding reshapes how teams should approach capability scaling in domains requiring search and verification.62




























