
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
Conversational AI has largely ignored the visual and gestural layer of human interaction, treating dialogue as speech-only. VideoFDB addresses this gap by introducing the first benchmark for evaluating agents that must both perceive and generate nonverbal cues alongside audio in real-time two-way exchanges. The dataset spans 237 video call clips annotated for 11 distinct conversational dynamics, paired with a rubric-based evaluation framework that separates perception from generation tasks. This work signals a maturation in multimodal agent design, pushing the field beyond speech-centric full-duplex systems toward embodied conversational intelligence that mirrors human social presence.62























