Modelwire
Subscribe

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Illustration accompanying: UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Researchers propose UDM-GRPO, the first framework combining Uniform Discrete Diffusion Models with reinforcement learning for stable training. The method treats final samples as actions and reconstructs diffusion trajectories to align with pretraining distributions, plus introduces efficiency strategies that outperform naive GRPO integration.

MentionsUDM-GRPO · Uniform Discrete Diffusion Model · GRPO · Reduced-Step · CFG-Free

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models · Modelwire