Research Tools & Code·arXiv cs.LG·5d ago

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Researchers identify a critical instability in GRPO-style reinforcement learning when training on sparse rewards: early training phases weight negative-advantage responses too heavily, and per-response length normalization skews gradient magnitudes toward longer outputs. Hysteretic Policy Optimization (HPO) addresses this by downweighting disadvantageous updates and switching to mean-length normalization, with an adaptive variant that tunes the hysteretic coefficient automatically from batch statistics. The fix is minimal but targets a real failure mode affecting reward model training at scale, particularly relevant as sparse-reward RL becomes standard for aligning language models on verifiable tasks.

Modelwire context

Explainer

The paper isolates two distinct sources of training collapse in GRPO that have likely gone undiagnosed in practice: negative-advantage responses dominating early gradients, and per-response length normalization systematically biasing the model toward verbose outputs. These aren't theoretical edge cases but concrete pathologies affecting production reward model training.

This work sits directly upstream of the In-Context Reward Adaptation paper from the same day. That research assumes reward models can be robust enough to adapt dynamically to new preference distributions. HPO addresses a prerequisite: making sure the reward model training process itself doesn't collapse or skew toward spurious correlations (like length bias) in the first place. Without stable base training, in-context adaptation has less stable ground to work from. The two papers together sketch a more complete picture of what robust RLHF pipelines need to survive deployment.

If teams training reward models on sparse-reward tasks report measurable reductions in training instability or length-bias artifacts after adopting HPO's mean-length normalization, that validates the diagnosis. Conversely, if the instability persists even with these fixes applied, it signals the real problem lies elsewhere (e.g., in how advantage estimates themselves are computed), and the paper has misidentified the bottleneck.

Coverage we drew on

In-Context Reward Adaptation for Robust Preference Modeling · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGRPO · Hysteretic Policy Optimization · HPO · Adaptive HPO · TeleLogs · Countdown

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.