Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Attributed Question Answering (AQA) aims to enhance the reliability of AI-generated answers by including references for each statement, helping users to validate the provided information. However, existing work on AQA has primarily focused on text-only input, and has largely overlooked the role of multimodality. We introduce MAVis, a first benchmark designed to evaluate end-to-end systems on understanding user intent behind visual questions, retrieving evidence from multimodal documents, and generating answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with sentence-level citations referring to multimodal documents. We develop automatic metrics along three dimensions -- informativeness, groundedness, and fluency -- and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs within multimodal RAG generate more informative and fluent answers than unimodal RAG but exhibit weak groundedness for image documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research.