United States

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the &quot;QA generation → VQA&quot; self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

AAAI 2025

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

mult modal vision

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Person Re-IDentification (ReID) aims to identify specific persons from non-overlapping cameras. Recently, some works have suggested using large-scale pre-trained vision-language models like CLIP to boost ReID performance. Unfortunately, existing methods still struggle to address two key issues simultaneously: efficiently transferring the knowledge learned from CLIP and comprehensively extracting the context information from images or videos.  To address above issues, we introduce CLIMB-ReID, a pioneering hybrid framework that synergizes the impressive power of CLIP with the remarkable computational efficiency of Mamba. Specifically, we first propose a novel Multi-Memory Collaboration (MMC) strategy to transfer CLIP's knowledge in a parameter-free and prompt-free form. Then, we design a Multi-Temporal Mamba (MTM) to capture multi-granular spatiotemporal information in videos. Finally, with Importance-aware Reorder Mamba (IRM), information from various scales is combined to produce robust sequence features. Extensive experiments show that our proposed method outperforms other state-of-the-art methods on both image and video person ReID benchmarks. We will release the source code for reproduction.

CLIMB-ReID: A Hybrid CLIP-Mamba Framework for Person Re-Identification

Concept-based methods have emerged as a promising direction to develop interpretable neural networks in standard supervised settings. However, most works that study them in incremental settings assume either a static concept set across all experiences or assume that each experience relies on a distinct set of concepts. In this work, we study concept-based models in a more realistic, dynamic setting where new classes may rely on older concepts in addition to introducing new concepts themselves. We show that concepts and classes form a complex web of relationships, which is susceptible to degradation and needs to be preserved and augmented across experiences. We introduce new metrics to show that existing concept-based models cannot preserve these relationships even when trained using methods to prevent catastrophic forgetting, since they cannot handle forgetting at concept, class, and concept-class relationship levels simultaneously. To address these issues, we propose a novel method - $\textbf{MuCIL}$ - that uses multimodal concepts to perform classification without increasing the number of trainable parameters across experiences. The multimodal concepts are aligned to concepts provided in natural language, making them interpretable by design. Through extensive experimentation, we show that our approach obtains state-of-the-art classification performance compared to other concept-based models, achieving over 2$\times$ the classification performance in some cases. We also study the ability of our model to perform interventions on concepts, and show that it can localize visual concepts in input images, providing post-hoc interpretations.

Walking the Web of Concept-Class Relationships in Incrementally Trained Interpretable Models

Blended-target domain adaptation (BTDA) leverages learned source knowledge to adapt the model to a blended-target domain that is composed of multiple unlabeled sub-target domains with distinct statistical characteristics. The existing BTDA methods usually overlook semantic correlation information across multiple domains and domain shifts among sub-target domains, resulting in suboptimal adaptation performance. To fully harness semantic knowledge and alleviate domain shifts in hybrid data distribution, we propose a collaborative semantic consistency alignment (CSCA) method for BTDA. Specifically, we achieve distribution alignment by minimizing the sliced Wasserstein distance between the source and target feature distributions. To alleviate complex domain shifts among all sub-target domains in the hybrid feature space, we design graph networks to propagate and share semantic knowledge across domains, which reduces semantic discrepancies among multiple domains. Additionally, we propose a double consistency regularization method to reduce the susceptibility of the model to domain-specific information, further facilitating semantic alignment and alleviating domain shifts. Extensive experiments on several datasets show that CSCA achieves promising classification performance.

Collaborative Semantic Consistency Alignment for Blended-Target Domain Adaptation

Gaze estimation methods encounter significant performance deterioration when being evaluated across different domains, because of the domain gap between the testing and training data. Existing methods try to solve this issue by reducing the deviation of data distribution, however, they ignore the existence of label deviation in the data due to the acquisition mechanism of the gaze label and the individual physiological differences. In this paper, we first point out that the influence brought by the label deviation cannot be ignored, and propose a gaze label alignment algorithm (GLA) to eliminate the label distribution deviation. Specifically, we first train the feature extractor on all domains to get domain invariant features, and then select an anchor domain to train the gaze regressor. We predict the gaze label on remaining domains and use a mapping function to align the labels. Finally, these aligned labels can be use to train gaze estimation models. Therefore, our method can be combined with any exiting method. Experimental results show that our GLA method can effectively alleviate the label distribution shift, and SOTA gaze estimation methods can be further improved obviously.

Gaze Label Alignment: Alleviating Domain Shift for Gaze Estimation

Multi-task learning (MTL) has emerged as a successful strategy in industrial-scale recommender systems, offering significant advantages such as capturing diverse users’ interests and accurately detecting different behaviors like “click" or “dwell time". However, negative transfer and the seesaw phenomenon pose challenges to MTL models due to the complex and often contradictory task correlations in real-world recommendations. To address the problem while making better use of personalized information, we propose a personalized Direct Routing Gradient framework (DRGrad), which consists of three key components: router, updater and personalized gate network. DRGrad judges the stakes between tasks in the training process, which can leverage all valid gradients for the respective task to reduce conflicts. We evaluate the efficiency of DRGrad on complex MTL using a real-world recommendation dataset with 15 billion samples. The results show that DRGrad’s superior performance over competing state-of-the-art MTL models, especially in terms of AUC (Area Under the Curve) metrics, indicating that it effectively manages task conflicts in multi-task learning environments without increasing model complexity, while also addressing the deficiencies in noise pro-cessing. Moreover, experiments on the public Census-income dataset and Synthetic dataset, have demonstrated the capability of DRGrad in judging and routing the stakes between tasks with varying degrees of correlation and personalization.

Direct Routing Gradient (DRGrad): A Personalized Information Surgery for Multi-Task Learning (MTL) Recommendations

Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology that facilitates the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in the biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET image segmentation tasks remains challenging due to two main issues: 1) the source dataset, usually obtained through simulation, contains a fixed level of noise, while the target dataset, directly collected from raw-data from the real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars. In contrast, the target domain data are often unknown, causing the model to be biased towards those known macromolecules, leading to a domain shift problem. To address such challenges, in this work, we introduce a voxel-wise unsupervised domain adaptation approach, termed Vox-UDA, specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise generation module to simulate target-like noises in the source dataset for cross-noise level adaptation. Additionally, we propose a denoised pseudo-labeling strategy based on the improved Bilateral Filter to alleviate the domain shift problem. More importantly, we construct the first UDA cryo-ET subtomogram segmentation benchmark on three experimental datasets. Extensive experimental results on multiple benchmarks and newly curated real-world datasets demonstrate the superiority of our proposed approach compared to state-of-the-art UDA methods.

Vox-UDA: Voxel-wise Unsupervised Domain Adaptation for Cryo-Electron Subtomogram Segmentation with Denoised Pseudo-Labeling

Infrared and visible image fusion is a crucial technique for enhancing visual performance by integrating unique information from different modalities into one fused image. Exiting methods pay more attention to conducting fusion with undisturbed data, while overlooking the impact of deliberate interference on the effectiveness of fusion results. To investigate the robustness of fusion models, in this paper, we propose a novel adversarial attack resilient network, called $\textrm{A}^{\textrm{2}}$RNet, for infrared and visible image fusion. Specifically, we develop an adversarial paradigm with an anti-attack loss function for IVIF to implement adversarial attacks and training. It is constructed based on the intrinsic nature of IVIF and provide a robust foundation for future research advancements. We adopt a Unet as the pipeline with a transformer-based defensive refinement module (DRM) under this paradigm, which guarantees fused image quality in a robust coarse-to-fine manner. Compared to previous work, our method mitigates the adverse effects of adversarial perturbations, consistently maintaining high-fidelity fusion results. Furthermore, the performance of downstream tasks can also be well maintained under adversarial attacks.

A²RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion

Image change captioning (ICC) poses great challenges stemming from describing subtle differences between two similar images in natural language, significantly increasing the complexity of feature extraction and cross-modal learning compared to the image captioning task. Existing ICC methods often suffer from two key challenges: 1) Massive irrelevant information of uni-image features leads to suboptimal visual difference representations; 2) Imprecise inter-modality correspondence degrades the quality of generated captions. This paper proposes a \textbf{D}ifference-awar\textbf{e} \textbf{C}ontrastive D\textbf{i}ffusion Model with A\textbf{d}versarial P\textbf{er}turbations (DECIDER) for ICC due to the excellent performance of diffusion models in image/text generation. Technically, difference-aware cross-modal learning is developed to suppress irrelevant information and learn compact yet robust visual difference representations. This is achieved by optimizing a novel objective mathematically derived from the information bottleneck principle that excels in filtering redundant features and highlighting differences. Furthermore, we propose to dynamically generate ``hard'' positive and negative samples via adversarial perturbations, which are involved in contrastive diffusion training with a tighter variational bound. This design encourages our DECIDER to excavate and construct complex correspondences between visual differences and captions, thereby improving generalization performance. Extensive experiments on four datasets demonstrate that DECIDER significantly exceeds state-of-the-art performance. The code will be publicly available upon acceptance.

DECIDER: Difference-aware Contrastive Diffusion Model with Adversarial Perturbations for Image Change Captioning

Significant disparities between the features of natural images and those inherent to histopathological images make it challenging to directly apply and transfer pre-trained models from natural images to histopathology tasks. Moreover, the frequent lack of annotations in histopathology patch images has driven researchers to explore self-supervised learning methods like mask reconstruction for learning representations from large amounts of unlabeled data. Crucially, previous mask-based efforts in self-supervised learning have often overlooked the spatial interactions among entities, which are essential for constructing accurate representations of pathological entities. To address these challenges, constructing graphs of entities is a promising approach. In addition, the diffusion reconstruction strategy has recently shown superior performance through its random intensity noise addition technique to enhance the robust learned representation. Therefore, we introduce $\textbf{H-MGDM}$, a novel self-supervised $\textbf{H}$istopathology image representation learning method through the Dynamic Entity-$\textbf{M}$asked $\textbf{G}$raph $\textbf{D}$iffusion $\textbf{M}$odel. Specifically, we propose to use complementary subgraphs as latent diffusion conditions and self-supervised targets respectively during pre-training. We note that the graph can embed entities' topological relationships and enhance representation. Dynamic conditions and targets can improve pathological fine reconstruction. Our model has conducted pretraining experiments on three large histopathological datasets. The advanced predictive performance and interpretability of H-MGDM are clearly evaluated on comprehensive downstream tasks such as classification and survival analysis on six datasets The advanced predictive performance and interpretability of H-MGDM are clearly evaluated on comprehensive downstream tasks such as classification and survival analysis on six datasets average improved by 5.18%. Our code will be publicly available.

Dynamic Entity-Masked Graph Diffusion Model for Histopathology Image Representation Learning

Existing unsupervised distillation-based methods rely on the differences between encoded and decoded features to locate abnormal regions in test images. However, the decoder trained only on normal samples still reconstructs abnormal patch features well, degrading performance. This issue is particularly pronounced in unsupervised multi-class anomaly detection task. We attribute this behavior to 'over-generalization' (OG) of decoder: the significantly increasing diversity of patch patterns in multi-class training enhances the model generalization on normal patches, but also inadvertently broadens its generalization to abnormal patches. To mitigate 'OG', we propose a novel approach that leverages class-agnostic learnable prompts to capture common textual normality across various visual patterns, and then apply them to guide the decoded features towards a 'normal' textual representation, suppressing 'over-generalization' of the decoder on abnormal patterns. To further improve performance, we also introduce a gated mixture-of-experts module to specialize in handling diverse patch patterns and reduce mutual interference between them in multi-class training. Our method achieves competitive performance on the MVTec AD and VisA datasets, demonstrating its effectiveness. The code will be released soon.

Premium content

Next from AAAI 2025

CLIMB-ReID: A Hybrid CLIP-Mamba Framework for Person Re-Identification

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES