Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text, providing a flexible interface to adapt to personalized visual preferences. Unfortunately, intrinsic noise in CIR triplets incurs training uncertainty and threatens model's robustness. Despite the probabilistic learning approaches in multi-modal retrieval, they still fall short for CIR due to their instance-level holistic modeling and homogeneous treatments for queries and targets. In this paper, we propose a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings capturing detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. In particular, given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive the comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying each of the technical contributions. More encouragingly, the learned representations show intuitive correlation to some image and text attributes, with the uncertainty magnitudes reflecting ambiguity degrees, providing interpretability insights. We will release the code, detailed instructions and training files.
