Research Models & Releases·arXiv cs.CL·Apr 20

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Researchers propose OneVL, a vision-language model that compresses chain-of-thought reasoning into latent tokens for real-time autonomous driving. The approach combines a VLA with a world model to capture causal dynamics rather than pure language, addressing latency bottlenecks in current CoT methods.

Modelwire context

Explainer

The key move here isn't just speed: by routing reasoning through latent tokens tied to a world model, OneVL encodes causal structure about how the environment evolves, not just linguistic descriptions of it. That's a meaningful architectural choice, not merely a compression trick.

The latent-compression angle connects directly to K-Token Merging (covered April 16), which similarly collapses token sequences in embedding space to cut inference overhead. Both papers are converging on the same intuition: that the vocabulary layer is a bottleneck worth bypassing for certain tasks. Where K-Token Merging targets general LLM inference, OneVL applies the same pressure specifically to the planning loop in autonomous driving, where milliseconds matter in ways they don't for a chatbot. Also worth noting: SpecGuard (April 16) attacked latency from the decoding side via speculative verification. OneVL attacks it from the representation side. These are complementary approaches to the same wall researchers keep hitting when trying to run chain-of-thought in real-time settings.

The credibility test is whether OneVL's latency gains hold on closed-loop driving benchmarks like nuPlan or CARLA under distribution shift, not just the offline evaluations typical in arXiv papers. If an autonomous driving lab picks this up for on-vehicle testing within the next six months, the world-model framing has legs; if it stays in simulation, the causal-dynamics claim remains unverified.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOneVL · Vision-Language Models · Chain-of-Thought reasoning · Autonomous driving

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.