Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Researchers identify a fundamental failure mode in multi-turn LLM reasoning: models drift from correct answers when information arrives incrementally rather than all at once, even when the total evidence is identical. The root cause is self-anchored drift, where partial-context responses embed unsupported assumptions that contaminate downstream reasoning. Canonical-Context On-Policy Distillation (CCOPD) addresses this by training a student model against a teacher conditioned on complete context, forcing consistency across conversation trajectories. This work matters because production LLMs routinely operate in multi-turn settings where information unfolds gradually, and the gap between single-prompt and incremental performance directly impacts reliability in real-world deployments.

Modelwire context

Explainer

The key detail the summary leaves implicit is that CCOPD is a training-time intervention, not an inference-time patch. The model isn't being given better prompts or retrieval at runtime; it's being shaped during training to resist the drift that partial context induces, which means the fix travels with the deployed model rather than requiring architectural changes at serving time.

This connects directly to the coherence cluster forming in recent coverage. 'Locally Coherent, Globally Incoherent' (also from late May) identified a structurally similar failure: components that look valid in isolation producing outputs that violate consistency at the system level. CCOPD is essentially attacking the same failure mode one layer down, at the single-model turn boundary rather than the multi-agent boundary. Both papers are converging on the same uncomfortable finding: LLM reasoning is highly sensitive to the order and completeness of information presentation, not just its content.

The real test is whether CCOPD-trained models hold their consistency gains on adversarial multi-turn benchmarks where the incremental information is deliberately misleading rather than merely incomplete. If gains collapse under that condition, the method is correcting for information ordering but not for the deeper anchoring problem.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCCOPD · LLM

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.