United States

Foundational vision-language models like CLIP are emerging as a promising paradigm in vision due to their excellent generalization. However, adapting these models for downstream tasks while maintaining their generalization remains challenging. In literature, one branch of methods adapts CLIP by learning prompts using images. While effective, these methods often rely on image-label data, which is not always practical, and struggle to generalize to new datasets due to overfitting on few-shot source data. Another approach explores training-free methods by generating class captions from large language models (LLMs) and performing prompt ensembling, but these methods often produce static, class-specific prompts that cannot be transferred to new classes and incur additional costs by generating LLM descriptions for each class separately.
In this work, we aim to combine the strengths of both approaches by learning prompts using only text data derived from LLMs. As supervised training of prompts in the image-free setup is non-trivial, we develop a language-only efficient training approach that enables prompts to distill rich contextual knowledge from LLM data. Furthermore, by mapping the LLM contextual text data within the learned prompts, our approach enables zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized and transferable prompts for image tasks using only text data.
We perform evaluations on four benchmarks, where our method improves over prior ensembling methods while being competitive with those utilizing labeled images. Our code will be made public.

AAAI 2025

Learning to Prompt with Text Only Supervision for Vision-Language Models

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



While Large Language Models (LLMs) show promise for Text-Attributed Graphs (TAGs) learning, their deployment is hindered by computational demands. Graph Neural Networks (GNNs) are efficient but struggle with TAGs' complex semantics. We propose LinguGKD, a novel LLM-to-GNN knowledge distillation framework that enables transferring both local semantic details and global structural information from LLMs to GNNs. First, it introduces TAG-oriented instruction tuning, enhancing LLMs with graph-specific knowledge through carefully designed prompts. Next, it develops a layer-adaptive multi-scale contrastive distillation strategy aligning LLM and GNN features at multiple granularities, from node-level to graph-level. Finally, the distilled GNNs combine the semantic richness of LLMs with the computational efficiency of traditional GNNs. Experiments demonstrate that LinguGKD outperforms existing graph distillation frameworks, the distilled simple GNNs achieve comparable or superior  performance to more complex GNNs and teacher LLMs, while maintaining computational efficiency. This work bridges the gap between LLMs and GNNs, facilitating advanced graph learning in resource-constrained environments and providing a framework to leverage ongoing LLM advancements for GNN improvement.

Large Language Model Meets Graph Neural Network in Knowledge Distillation

The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs). The code will be released upon acceptance.

M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

Diffusion-based generative models have recently excelled in generating molecular conformations but struggled with the generalization issue -- models trained on one dataset may produce meaningless conformations on out-of-distribution molecules. 
On the other hand, distance geometry serves as a generalizable tool for the traditional computational chemistry methods of molecular conformation, which is predicated on the assumption that it is possible to adequately define the set of all potential conformations of any non-rigid molecular system using purely geometric constraints.
In this work, we for the first time explicitly incorporate distance geometry constraints into pretraining phase of diffusion-based molecular generation models to improve the generalizability.
Inspired by the classical distance geometry solution designed for solving the molecular distance geometry problem, we propose $\textbf{MiGDiff}$, a $\textbf{M}$etrization-$\textbf{I}$nformed $\textbf{G}$eometric $\textbf{Diff}$usion framework. 
\textbf{MiGDiff} injects distance geometry constraints by pretraining the deep geometric diffusion backbone within the $\textbf{Metrization}$ sampling approach, yielding a ''$\textbf{Metrization}$-driven pretraining + Data-driven finetuning'' paradigm. 
Experimental results demonstrate that $\textbf{MiGDiff}$ outperforms state-of-the-art methods and possesses strong generalization capabilities, particularly on generating previously unseen molecules, revealing the vast untapped potential of combining traditional computational methods with deep generative models for 3D molecular generation.

Enhancing Generalizability in Molecular Conformation Generation with $\textbf{Metrization}$-Informed Geometric Diffusion Pretraining

Low-light image enhancement (LIE) aims at precisely and efficiently recovering an image degraded in poor illumination environments. Recent advanced LIE techniques are using deep neural networks, which require lots of low-normal light image pairs, network parameters, and computational resources. As a result, their practicality is limited. In this work, we devise a novel unsupervised LIE framework based on diffusion priors and lookup tables (DPLUT) to achieve efficient low-light image recovery. The proposed approach comprises two critical components: a light adjustment lookup table (LLUT) and a noise suppression lookup table (NLUT). LLUT is optimized with a set of unsupervised losses. It aims at predicting pixel-wise curve parameters for the dynamic range adjustment of a specific image. NLUT is designed to remove the amplified noise after the light brightens. As diffusion models are sensitive to noise, diffusion priors are introduced to achieve high-performance noise suppression. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in terms of visual quality and efficiency.

DPLUT: Unsupervised Low-light Image Enhancement with Lookup Tables and Diffusion Priors

Existing low-light image enhancement (LIE) methods have achieved noteworthy success in solving synthetic distortions, yet they often fall short in practical applications. The limitations arise from two inherent challenges in real-world LIE: 1) the collection of distorted/clean image pairs is often impractical and sometimes even unavailable, and 2) accurately modeling complex degradations presents a non-trivial problem. To overcome them, we propose the Attribute Guidance Diffusion framework (AGLLDiff), a training-free method for effective real-world LIE. Instead of specifically defining the degradation process, AGLLDiff shifts the paradigm and models the desired attributes, such as image exposure, structure and color of normal-light images. These attributes are readily available and impose no assumptions about the degradation process, which guides the diffusion sampling process to a reliable high-quality solution space. Extensive experiments demonstrate that our approach outperforms the current leading unsupervised LIE methods across benchmarks in terms of distortion-based and perceptual-based metrics, and it performs well even in sophisticated wild degradation.

AGLLDiff: Guiding Diffusion Models Towards Unsupervised Training-free Real-world Low-light Image Enhancement

Facial expression recognition (FER) has suffered from label ambiguity due to the inherent subjectivity of facial expressions. Additionally, the class imbalance prevalent in real-world scenarios further complicates the challenges in FER. Although many studies have shown impressive improvements, they typically address only one of these issues, leading to suboptimal results. To address both challenges simultaneously, we propose a novel framework called Navigating Label Ambiguity (NLA), which is robust in real-world conditions. The core idea behind NLA is to adaptively assign weights based on sample ambiguity while minimizing the impact of noise. To achieve this, NLA consists of two main components: Noise-aware Adaptive Weighting (NAW) and consistency regularization. Specifically, NAW adjusts weights by assigning higher importance to ambiguous samples and lower importance to noisy ones, based on the relationship between the intermediate prediction scores for the ground truth and the nearest negative. We also enhance the reliability of our NLA by incorporating a regularization term that ensures consistent latent distributions. Consequently, NLA effectively handles not only noise but also class imbalance by allowing the model to progressively focus on more challenging ambiguous samples that mainly belong to the minority class. Extensive experiments demonstrate that NLA outperforms existing methods in both overall and mean accuracy, confirming its robustness against noise and class imbalance. To the best of our knowledge, we are the first to tackle both problems within a single framework.

Navigating Label Ambiguity for Facial Expression Recognition in the Wild

Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

All You Need in Knowledge Distillation Is a Tailored Coordinate System

Recovering 4D world from monocular video is a crucial yet challenging task. 
Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes.
In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses.
To solve this, we introduce **GFlow**, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. 
GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points.
This method ensures consistency among adjacent points and smooth transitions between frames.
Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. 
By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames.
Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness.

GFlow: Recovering 4D World from Monocular Video

Cascade ranking architecture, composed of matching, pre-ranking, ranking and re-ranking stages, is usually adopted to balance the efficiency and effectiveness in real-word recommendation system (RS). As the middle stage of RS, pre-ranking aims to quickly filter out the low-quality items selected at the matching stage and then forwarding high-quality items to the ranking stage. Existing pre-ranking approaches mainly endure two problems 1) Sample Selection Bias (SSB) problem, which heavily limits the performance improvement of filtering out low-quality items owing to ignoring the data flow between stages; and 2) Ranking Consistency (RC) problem, which may cause the ranked lists of the ranking stage and previous pre-ranking stage to be inconsistent. As a result, the competitive items with high scores at the ranking stage may not be selected because of low scores at the pre-ranking stage. These both two problems may cause sub-optimal performances, but previous works usually only focus on the one of them. In this paper, we propose a novel Sample Debias and Ranking Consistency Joint Learning Framework (SDCL) to jointly alleviate SSB and RC problems. SDCL consists of two main modules including 1) Multi-Task Distillation Module (MTD), which enhances the ability of identifying high-quality items by distilling knowledge across all tasks simultaneously from the more complex ranking model which jointly trained with the pre-ranking model; and 2) Adaptive Negative Sample Learning Module (ANSL), which improves the performance of filtering out low-quality items by adaptively adjusting negative samples learning weights based on the current performance of model. SDCL seamlessly integrates two modules in an end-to-end multi-task learning framework. Evaluations on both real-world large-scale traffic logs and online A/B test demonstrate the efficacy and superiority of SDCL.

Both Supply and Precision: Sample Debias and Ranking Consistency Joint Learning for Large Scale Pre-Ranking System

Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost, as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a State Space Model (SSM) with linear complexity to address the quadratic complexity of self-attention.  Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a 7.5% improvement in mAP on a single 4090 GPU with an inference time of 1.5 ms.

Premium content

Next from AAAI 2025

Large Language Model Meets Graph Neural Network in Knowledge Distillation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES