Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Audio descriptions (ADs) are indispensable for blind or visually-impaired individuals (BVIs), enabling them to understand the narrative and appreciate the visual diversity of movies. There is an explosion of interest in automatic AD generation for trimmed clips and many new metrics have also been proposed. However, they typically compare single ground-truth ADs against their prediction. We posit that ADs should not be treated as independent captions and pivot to a video-level evaluation. We propose ADQA, a question-answering benchmark to evaluate whether the generated ADs would help BVIs appreciate and understand the story. We motivate the QA framework by quantifying the subjective nature of ADs through an alignment between two AD sources of the same movie. ADQA features visual appreciation (VA) questions about specific visual facts and narrative understanding (NU) questions created using plot sentences associated with videos. Evaluation of current AD generation methods show a large gap to human performance, estimated by using the second AD source. Based on our findings, we provide several recommendations for future work on AD generation.