
From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG
A new benchmark and retrieval framework addresses a critical gap in multimodal RAG systems: current architectures retrieve at scene-level granularity, obscuring which specific visual elements support generated claims. GranuVistaVQA and GranuRAG shift retrieval to element-level units, enabling finer-grained evidence attribution and verifiability. This matters because as multimodal systems move into high-stakes domains, coarse retrieval creates both accuracy and auditability problems. The work signals growing pressure on RAG builders to decompose visual evidence into interpretable, attributable components rather than treating images as atomic units.62





















