Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision language models (LVLMs) to generate Embodied LVLM Emotion Narratives (textbfELENA). These are well-defined, multi-layered text outputs, mainly comprising narrative-based descriptions focused on the salient body parts involved in emotional reaction. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that LVLMs can effectively recognize embodied emotions in masked images compared to baseline naive prompts. They achieve this without any fine-tuning when guided by the ELENA framework. ELENA charts a new trajectory for embodied-emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.