Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

Researchers studying order-agnostic language models reveal a fundamental gap between training objectives and inference behavior. When these models generate text in different token reveal sequences, likelihood scores shift by up to 0.49 nats per token, indicating the learned conditionals don't form a coherent joint distribution. This finding matters because it exposes how path-dependent artifacts contaminate standard evaluation metrics, mixing genuine content difficulty with order-specific noise. The work also shows that confidence-first decoding, despite being order-agnostic by design, gravitates toward left-to-right generation on content tokens. For practitioners building or evaluating discrete diffusion models, this suggests current scoring methods may misrepresent model quality and that decoding strategy choice carries hidden structural consequences.

Modelwire context

Explainer

The paper reveals that order-agnostic models don't actually learn coherent joint distributions over tokens. This means the conditionals they learn during training are artifacts of the training path, not genuine probability estimates, which invalidates how we currently score these models.

This finding directly contextualizes two recent stories on masked diffusion language models. The DSL-LLaDA work from May 31st showed that continuous denoising can sidestep tradeoffs in few-step decoding, and the D3IM paper from the same day identified preservation bias as a structural model limitation. This new analysis suggests the problem runs deeper: the models aren't learning stable conditional probabilities at all, which means both the efficiency gains and the self-correction challenges may be harder to reason about than current approaches assume. The confidence-first decoding result also hints that even order-agnostic samplers gravitate toward left-to-right patterns, suggesting path dependence is baked into model behavior, not just evaluation.

If researchers retrain order-agnostic models with explicit joint distribution constraints (e.g., via importance weighting across permutations) and show that likelihood variance drops below 0.1 nats per token while maintaining generation quality, that would confirm this is a training objective problem rather than an inference artifact. If the variance persists, it suggests the models fundamentally cannot learn order-invariant distributions at scale.

Coverage we drew on

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLaDA-2.1 · discrete diffusion language models · order-agnostic language models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.