
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Composed Image Retrieval (CIR) aims to retrieve specific images using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task. However, these approaches typically fail to simultaneously meet two requirements of CIR: comprehensively extracting visual information and faithfully following the user intent, both of which are crucial for obtaining the desired query information in CIR. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Notably, we design a novel intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt is proposed to clarify the task requirement and assist the model discern user intent at the task level. (2) The instance-specific soft prompt is adaptively selected from the learnable prompt pool, which enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks. We believe this study provides fundamental insights into CIR-related fields.