United States

Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derived from local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66% to 76.96% on 11 benchmarks, surpassing CoCoOp which introduced instance-wise feature for text prompts. Equipped with TextRefiner, PromptKD achieve state-of-the-art performance while keep inference efficient.

AAAI 2025

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning

mult modal vision

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Efficient tracking has garnered extensive attention as it aims to operate on resource-constrained platforms for real-world deployment, not solely on desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream network may not be optimal for efficient tracking, as incorporating template computation in every frame leads to significant redundancy, and the pervasive semantic interaction between template and search region places substantial stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named AsymTrack for efficient tracking. AsymTrack disentangles the template and search streams into separate branches, with template computation occurring only once during initialization to generate modulation signals for interaction. Based on this innovative architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-art. For instance, AsymTrack-T achieves 60.8% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0% AUC with higher speeds. Code will be available.

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking

High-quality, high-resolution medical imaging is essential for clinical care. Raman-based biomedical optical imaging uses non-ionizing infrared radiation to evaluate human tissues in real time and is used for early cancer detection, brain tumor diagnosis, and intraoperative tissue analysis. Unfortunately, optical imaging is vulnerable to image degradation due to laser scattering and absorption, which can result in diagnostic errors and misguided treatment. Restoration of optical images is a challenging computer vision task because the sources of image degradation are multi-factorial, stochastic, and tissue-dependent, preventing a straightforward method to obtain paired low-quality/high-quality data. Here, we present Restorative Step-Calibrated Diffusion (RSCD): an unpaired diffusion-based image restoration method that uses a step calibrator model to dynamically determine the number of steps required to complete the reverse diffusion process for image restoration. RSCD outperforms other widely used unpaired image restoration methods on both image quality and perceptual evaluation metrics for restoring optical images. Medical imaging experts consistently prefer images restored using RSCD in blinded comparison experiments and report minimal to no hallucinations. Finally, we show that RSCD improves performance on downstream clinical imaging tasks, including automated brain tumor diagnosis and deep tissue imaging.

Step-Calibrated Diffusion for Biomedical Optical Image Restoration

The method for image-to-point cloud registration typically determines the rigid transformation using a coarse-to-fine pipeline. However, directly and uniformly matching image patches with point cloud patches may lead to focusing on incorrect noise patches during matching while ignoring key ones. Moreover, due to the significant differences between image and point cloud modalities, it may be challenging to bridge the domain gap without specific improvements in design. To address the above issues, we innovatively propose the Uncertainty-aware Hierarchical Matching Module (UHMM) and the Adversarial Modal Alignment Module (AMAM). Within the UHMM, we model the uncertainty of critical information in image patches and facilitate multi-level fusion interactions between image and point cloud features. In the AMAM, we design an adversarial approach to reduce the domain gap between image and point cloud. Extensive experiments and ablation studies on RGB-D Scene V2 and 7-Scenes benchmarks demonstrate the superiority of our method, making it a state-of-the-art approach for image-to-point cloud registration tasks. Our code will be released.

Bridge 2D-3D: Uncertainty-aware Hierarchical Registration Network with Domain Alignment

In multi-view multi-label classification (MVML), each object is described by several heterogeneous views while annotated with multiple related labels. The key to learn from such complicate data lies in how to fuse cross-view features and explore multi-label correlations,  while accordingly obtain correct assignments between each object and its corresponding labels.  In this paper, we proposed an advanced MVML method named VAMS,  which treats each object as a bag of views and reformulates the task of MVML as a “view-label” matching selection problem.  Specifically,  we first construct an object graph and a label graph respectively.  In the object graph, nodes represent the multi-view representation of an object, and each view node is connected to its K-nearest neighbor within its own view. In the label graph, nodes represent the semantic representation of a label. Then, we connect each view nodes with all labels to generate the unified “view-label” matching graph. Afterwards, a graph network block is introduced to aggregate and update all nodes and edges on the matching graph, and further generating structural representation that fuse multi-view heterogeneity and multi-label correlations for each view and label. Finally, we derives a prediction score for each view-label matching and select the optimal matching via optimizing a weighted cross-entropy loss. Extensive results on various datasets have verified that our proposed VAMS can achieve superior or comparable performance against state-of-the-art methods.

Multi-View Multi-Label Classification via View-Label Matching Selection

Vehicle re-identification aims to match vehicles across non-overlapping camera views. Many existing methods extract features from one specific image, and these methods lack view-invariance when comparing vehicles of different orientations. As a result, discriminative parts obscured by viewpoint changes cannot contribute effectively to matching. This work presents a novel keypoint-based framework for vehicle Re-ID. We propose to explicitly model the intrinsic structural relationships between vehicle components via knowledge graph. By establishing connection between keypoints, our approach aims to leverage such prior to match vehicles even when some parts are not directly comparable due to orientation inconsistencies. Specifically, given query and gallery images, we first detect visible keypoints. Then, a transformer-based model infers features for non-overlapped keypoints by conditioning on visible correspondences defined in the knowledge graph. The final representation integrates visible and inferred features. Extensive experiments demonstrate our method outperforms state-of-the-arts on standard benchmarks under cross-view matching scenarios. To our knowledge, this is the first work introducing structural priors via keypoint knowledge graphs for view-invariant vehicle re-identification.

Infer the Whole from a Glimpse of a Part: Keypoint-Based Knowledge Graph for Vehicle Re-Identification

Recently, anchor-based incomplete multi-view clustering (IMVC) has been widely adopted for fast clustering, but most existing approaches still encounter some issues: (1) They generally rely on the observed samples to construct anchor graphs, ignoring the potentially useful information of missing instances. (2) Most methods attempt to learn a consensus anchor graph, failing to fully excavate the complementary information and high-order correlations across views. (3) They generally apply post-processing on learned anchor graph to seek latent embeddings, making them not globally-optimal. To address these issues, this paper proposes a novel fast IMVC approach with Adaptive Similarity Completion and Reconstruction (ASCR), which unifies anchor learning, anchor-sample similarity construction and completion, and latent multi-view embedding learning in a joint framework. Specifically, ASCR learns an anchor-sample similarity graph for each view, and the missing values are fulfilled to mitigate the adverse effects. To explore the consistent and complementary information across views, ASCR simultaneously seeks the view-specific anchor embeddings and sample embeddings in a latent subspace by similarity reconstruction, which not only preserves the semantic information into latent embeddings but also enhances the low-rank property of similarity graphs, achieving a reliable graph completion process. Furthermore, the high-order cross-view correlations are explored with tensor-based regularization. Finally, the multi-view sample embeddings are combined for clustering. Extensive experimental results demonstrate the superiority and efficiency of ASCR compared with state-of-the-art approaches.

Fast Incomplete Multi-view Clustering with Adaptive Similarity Completion and Reconstruction

Existing few-shot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross-domain few-shot medical image segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain.
Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency-aware Matching Network (FAMNet), which includes two key
components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta-learning phase: 1) intra-domain variance caused by the inherent support-query bias, due to the different appearances of organs and lesions, and 2) inter-domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter-domain variance on the model's segmentation performance.
Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation models on three cross-domain datasets, achieving state-of-the-art performance in the CD-FSMIS task.

FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot Medical Image Segmentation

The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a novel event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

In this paper, we present CAD2Program, a new method for reconstructing 3D parametric models from 2D CAD drawings. Our proposed method is inspired by recent successes in vision-language models (VLMs), and departs from traditional methods which rely on task-specific data representations and/or algorithms. Specifically, on the input side, we simply treat the 2D CAD drawing as a raster image, regardless of its original format, and encode the image with a standard ViT model. We show that such an encoding scheme achieves competitive performance against existing methods that operate on vector-graphics inputs, while imposing substantially fewer restrictions on the 2D drawings. On the output side, our method auto-regressively predicts a general-purpose language describing 3D parametric models in text form. Compared to other sequence modeling methods for CAD which use domain-specific sequence representations with fixed-size slots, our text-based representation is more flexible, and can be easily extended to arbitrary geometric entities and semantic or functional properties. Experimental results on a large-scale dataset of cabinet models demonstrate the effectiveness of our method.

From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach

Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

Premium content

Next from AAAI 2025

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES