
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Researchers have developed a training-free diagnostic framework that resolves a critical blind spot in on-policy distillation, a technique increasingly used to train reasoning models with dense token-level supervision. The work moves beyond aggregate metrics to pinpoint exactly when teacher guidance helps or hurts individual predictions, and whether optimal teacher selection should vary token-by-token. This addresses a practical bottleneck for teams scaling reasoning models: current evaluation requires expensive training runs that obscure failure modes. The framework's per-token, per-question resolution enables faster iteration on distillation strategies without costly experimentation, directly impacting how efficiently labs can optimize reasoning model training.62




















