Modelwire
Subscribe

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Illustration accompanying: Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

ToolMerge introduces a decomposition-based approach to keyframe retrieval in long-form video QA, where an LLM planner breaks down user queries into discrete tool calls and specifies how their rankings combine via boolean logic. This addresses a fundamental limitation in existing systems that treat queries monolithically or apply rigid schemas. The authors validate the method on Molmo-2 Moments, a newly constructed benchmark that grounds questions to specific temporal intervals, enabling direct measurement of retrieval accuracy. The work signals growing sophistication in multimodal reasoning pipelines, where query understanding and tool orchestration become first-class concerns rather than afterthoughts in video understanding systems.

Modelwire context

Explainer

The key insight is that ToolMerge treats query understanding as a planning problem, not a retrieval problem. Rather than asking 'what frames match this question,' the system asks 'what sub-questions does this query contain, and how should their results combine.' This inversion matters because it moves the LLM from passenger (ranking frames) to architect (designing the retrieval strategy itself).

This connects directly to ETCHR's core finding from earlier this month: task decomposition and specialized components outperform unified end-to-end systems for fine-grained reasoning. Where ETCHR separated image editing from language understanding, ToolMerge separates query planning from frame ranking. Both papers reject the assumption that a single model should handle heterogeneous subtasks. The difference is scope: ETCHR works on static images, while ToolMerge scales decomposition to temporal reasoning over hours of video, suggesting the principle generalizes beyond single-frame problems.

If ToolMerge's approach transfers to existing video QA benchmarks (like NExT-QA or Temporal Reasoning for Video Understanding) without the new Molmo-2 Moments dataset, that confirms the decomposition strategy is genuinely robust. If performance gains vanish outside Molmo-2 Moments, the contribution is primarily a better benchmark rather than a better method.

Coverage we drew on

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsToolMerge · Molmo-2 Moments · LLM

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval · Modelwire