
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Researchers have identified a critical failure mode in multimodal large language models where visual reasoning tokens become semantically rich during training but are systematically ignored during inference, a phenomenon termed Silenced Visual Latents. The model defaults to shortcuts using direct visual input rather than leveraging the latent reasoning space, undermining the efficiency gains of continuous latent-space reasoning over explicit chain-of-thought. This work exposes a fundamental optimization pathology in how shared parameter spaces handle competing objectives, with implications for how future MLLMs should architect their reasoning pathways to prevent learned representations from being suppressed by simpler input shortcuts.62









