
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR addresses a core inefficiency in multimodal reasoning: the fragmentation of thought across separate text and vision pathways. Rather than interleaving chain-of-thought text with visual tokens, this framework unifies both into a shared visual workspace, compressing the combined representation into compact latent tokens that the model reasons through at inference time. This shift from dual-channel to unified latent reasoning could meaningfully reduce computational overhead and improve coherence in vision-language tasks, signaling a maturing approach to how LLMs integrate reasoning across modalities.62
























