Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background

AAAI 2025

February 28, 2025

Philadelphia, United States

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Composed Image Retrieval (CIR) aims to retrieve specific images using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task. However, these approaches typically fail to simultaneously meet two requirements of CIR: comprehensively extracting visual information and faithfully following the user intent, both of which are crucial for obtaining the desired query information in CIR. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Notably, we design a novel intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt is proposed to clarify the task requirement and assist the model discern user intent at the task level. (2) The instance-specific soft prompt is adaptively selected from the learnable prompt pool, which enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks. We believe this study provides fundamental insights into CIR-related fields.

Next from AAAI 2025

Local Conditional Controlling for Text-to-Image Diffusion Models
poster

Local Conditional Controlling for Text-to-Image Diffusion Models

AAAI 2025

+9
Yao Chen and 11 other authors

28 February 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2026 Underline - All rights reserved