Research·arXiv cs.CL·2d ago

P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

Researchers propose P2-DPO, a refinement to Direct Preference Optimization that targets a specific failure mode in vision-language models: hallucination rooted in weak perceptual grounding rather than language-level errors. The method generates preference pairs on-policy and focuses training on visual robustness in degraded image conditions, addressing a gap in existing DPO approaches that treat vision and language alignment generically. This work matters because it reframes hallucination as a perception problem first, shifting how teams should debug and train multimodal systems, particularly for applications requiring reliable visual understanding under real-world image quality variation.

Modelwire context

Explainer

P2-DPO isolates hallucination as a perceptual calibration failure rather than a language modeling failure. The key insight is that existing DPO methods treat vision and language as interchangeable alignment problems, missing that models can fail to ground what they see before they fail to describe it.

This connects directly to the hallucination rejection sampling work from earlier today (SHARS), which tackles hallucination propagation at inference time. Where SHARS filters unreliable segments mid-generation, P2-DPO addresses the upstream problem: weak visual grounding that seeds hallucinations before language generation even begins. The two approaches are complementary rather than competitive. P2-DPO also echoes the ProtoAda paper's insight that multimodal systems need task-specific routing beyond surface similarity, here applied to the specific task of robust visual perception under degraded conditions.

If P2-DPO shows measurable gains on vision-language benchmarks that explicitly test degraded image quality (low resolution, compression artifacts, occlusion), but shows no improvement on standard clean-image benchmarks, that confirms the method is actually solving perception robustness rather than general alignment. If the same gains appear on both, the contribution is narrower than claimed.

Coverage we drew on

Building Reliable Long-Form Generation via Hallucination Rejection Sampling · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Vision-Language Models · Direct Preference Optimization · P2-DPO

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.