United States

Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, the massive number of visual tokens incurs a significant computational cost. Existing analysis of the MLLM attention mechanisms is unfortunately shallow, leading to coarsely specified token pruning strategies that are unable to strike a balance between speed and accuracy. In this paper, we conduct a comprehensive investigation of MLLM attention mechanisms with LLaVA as the case study subject. We find that numerous visual tokens and partial attention computations are ineffective during the decoding process. Based on empirical insights, we propose Spatial-Temporal Visual Token Trimming ($\textbf{ST}^{3}$) with two primary components: 1) $\textit{Spatial}$: Progressive Visual Token Pruning ($\textbf{PVTP}$) and 2) $\textit{Temporal}$: Visual Token Annealing ($\textbf{VTA}$). ${\bf PVTP}$ eliminates inattentive visual tokens as layers deepen, while ${\bf VTA}$ dynamically reduces the number of visual tokens in each layer as the generated tokens grow. Together, these techniques achieve around $\mathbf{2\times}$ faster inference with only about $\mathbf{30}$% KV cache memory compared to the original LLaVA, while maintaining consistent performance across various datasets. 
Crucially, our proposed mechanisms are designed to be plug-and-play, allowing seamless integration with existing pre-trained MLLMs without incurring any additional training costs.

AAAI 2025

ST3: Accelerating Multimodal Large Language Model by Spatial-Temporal Visual Token Trimming

language and vision

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Learning with softmax cross-entropy on one-hot labels often leads to overconfidence on the correct class. While label smoothing regulates this overconfidence by redistributing $\alpha$ confidence from the correct class to other incorrect classes, it compromises the representation in the logits about the similarity between samples of different classes and may hurt calibration if a larger $\alpha$ is required for high accuracy. To overcome these limitations, we propose a Virtual Smoothing label that redistributes certain confidence from the correct class to additional Virtual Smoothing (VS) classes to regularize overconfidence. In VS labels, the VS class nodes act as adversaries to the original class nodes, enforcing regularization by clustering samples across all classes. The zero confidence of each incorrect class also allows the incorrect logits to be different from each other without erasing information about sample similarities. The prediction probability can still approach 1 when applying softmax to the logits of the original real classes, which avoids harming but consistently improves calibration. Experiments show that VS labels consistently improve accuracy and calibration while providing better logits for improved knowledge distillation. Additionally, VS labels exhibit effectiveness in improving adversarial training, robust distillation, and out-of-distribution detection.

Training Deep Neural Networks with Virtual Smoothing Classes

High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder’s attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20\% token budget increases token generation throughput by 4.7$\times$, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models

Quantum computing promises to revolutionize various fields, yet the execution of quantum programs necessitates an effective compilation process. This involves strategically mapping quantum circuits onto the physical qubits of a quantum processor. The qubits' arrangement, or topology, is pivotal to the circuit's performance, a factor that often defies traditional heuristic or manual optimization methods due to its complexity. In this study, we introduce a novel approach leveraging reinforcement learning to dynamically tailor qubit topologies to the unique specifications of individual quantum circuits, guiding algorithm-driven quantum processor topology design for reducing the depth of mapped circuit, which is particularly critical for the output accuracy on noisy quantum processors. Our method marks a significant departure from previous methods that have been constrained to mapping circuits onto a fixed processor topology. Experiments demonstrate that we have achieved notable enhancements in circuit performance, with a minimum of 20\% reduction in circuit depth in 60\% of the cases examined, and a maximum enhancement of up to 46\%. Furthermore, the pronounced benefits of our approach in reducing circuit depth become increasingly evident as the scale of the quantum circuits increases, exhibiting the scalability of our method in terms of problem size. This work advances the co-design of quantum processor architecture and algorithm mapping, offering a promising avenue for future research and development in the field.

AI-Powered Algorithm-Centric Quantum Processor Topology Design

Text-attributed graphs have recently garnered significant attention due to their wide range of applications in web domains. Existing methodologies employ word embedding models for acquiring text representations as node features, which are subsequently fed into Graph Neural Networks (GNNs) for training. Recently, the advent of Large Language Models (LLMs) has introduced their powerful capabilities in information retrieval and text generation, which can greatly enhance the text attributes of graph data. Furthermore, the acquisition and labeling of extensive datasets are both costly and time-consuming endeavors. Consequently, few-shot learning has emerged as a crucial problem in the context of graph learning tasks. In order to tackle this challenge, we propose a lightweight paradigm called LLM4NG, which adopts a plug-and-play approach to establish supervision signals by leveraging Large Language Models (LLMs) for node generation. Specifically, we utilize LLMs to extract semantic information from the labels and generate samples that belong to these categories as exemplars. Subsequently, we employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph. This approach harnesses LLMs for enhancing class-level information and seamlessly introduces labeled nodes and edges without modifying the raw dataset, thereby facilitating the node classification task in few-shot scenarios. Extensive experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios. For instance, in the 1-shot setting of the ogbn-arxiv dataset, LLM4NG achieves a 76% improvement over the baseline model.

Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs

Source-Free Domain Adaptation (SFDA) aims to transfer a pre-trained source model to the unlabeled target domain without accessing the source data, thereby effectively solving labeled data dependency and domain shift problems. However, the SFDA setting faces a bottleneck due to the absence of supervisory information. To mitigate this problem, Active Learning (AL) is introduced to combine with SFDA, endeavoring to actively label a small set of the most high-quality target points so that models with satisfactory performance can be obtained at an acceptable cost. Nevertheless, several issues remain unresolved, namely when to query new labels during training, what kind of samples deserve labeling to ensure rich information, and where the labels should be distributed to guarantee diversity. Thus we elaborate ActiveSFDA to omni bearing address the “When, What, and Where” problems about Active points querying in Source-Free Domain Adaptation for cross-modal 3D semantic segmentation. The method consists of three main components: Query Decider, Point Ranker, and Budget Slicer. The Query Decider determines the optimal timing to query new points by fitting the validation curves during training. The Point Ranker nominates points for annotation by calculating the ambiguity of neighboring points in the feature space. The Budget Slicer allocates the annotation quota, i.e., labeling percentage of the point cloud, to different semantic regions by utilizing the advanced 2D semantic segmentation capabilities of the Segment Anything Model (SAM). Extensive experiments demonstrate the effectiveness of our proposed method, achieving up to 99.64% of fully supervised performance with only 3% of labels, and consistently outperforming comparison methods across various scenarios.

Omni-Query Active Learning for Source-Free Domain Adaptive Cross-Modality 3D Semantic Segmentation

Unsupervised domain adaptation (UDA) is a machine learning approach designed to minimize reliance on labeled data by aligning features between a labeled source domain and an unlabeled target domain, thereby reducing feature discrepancies, which is efficient for multivariate time series (MTS) prediction. However, most MTS UDA methods focus solely on aligning intra-series temporal features, overlooking the valuable information in inter-series dependencies. Research has highlighted that analyzing decomposed frequency dependencies in time series can reveal significant trends, noise patterns, and intricate temporal details. To address these unexplored frequency dependencies, we introduce the Frequency Graph Discovery Module (FGD), which uncovers and aligns shared frequency information and correlations across domains. Additionally, we propose a Frequency-Contextual Contrastive Learning (FCCL) framework to better capture and align frequency-contextual representations in multivariate time series, ensuring the extraction of label-invariant information for prediction. Furthermore, considering existing models overlooking the valuable and abundant information outside source and target dataset, we enhance the MTS UDA prediction model with a Language-guided Adversary Alignment (LAA) module, which leverages the advancement and capabilities of Large Language Models (LLMs) to get text-encoded labeled embeddings and align the classification features, thereby improving prediction accuracy. Our model achieves state-of-the-art results on three public multivariate time-series datasets for unsupervised domain adaptation, as demonstrated by empirical evidence.

Enhancing Multivariate Time-Series Domain Adaptation via Contrastive Frequency Graph Discovery and Language-Guided Adversary Alignment

Graph contrastive learning (GCL) has drawn much research attention for its ability to learn node representations in a self-supervised manner. However, the homophily assumption inherent in GNN encoders limits the direction (macro-level) and the process (micro-level) of message passing in current GCL frameworks, impairing the expressive power of GCL in non-homophilous graphs. This paper presents a novel framework that employs Macro and Micro Message Passing in GCL ($\mathrm{M}^3\mathrm{P}\text{-}\mathrm{GCL}$) to overcome these limitations and advance performance in both homophilous and non-homophilous graphs. Specifically, at the macro-level, we integrate both structural and attribute views to enhance the direction of message passing, and employ a Aligned Priority-Supporting View Encoding (APS-VE) strategy to facilitate contrastive training; at the micro-level, we propose an Adaptive Self-Propagation (ASP) strategy based on role segmentation of self-loop to diversify the process of message passing in the encoder. These enhancements effectively address the limitations imposed by the homophily assumption. Experiments demonstrate that $\mathrm{M}^3\mathrm{P}\text{-}\mathrm{GCL}$ outperforms both supervised and unsupervised baselines in the node classification task on various datasets with different levels of homophily.

Beyond Homophily: Graph Contrastive Learning with Macro-Micro Message Passing

Compared to fully supervised object detection, training with sparse annotations typically leads to a decline in performance due to insufficient feature diversity. Existing sparsely annotated object detection (SAOD) methods often rely on pseudo-labeling strategies, but these pseudo-labels tend to introduce noise under extreme sparsity. To simultaneously avoid the impact of pseudo-label noise and enhance feature diversity, we propose a novel Adaptive Feature Generation (AdaptFG) model that generates features based on class names. This model integrates a pre-trained CLIP into a VAE-based feature generator, with its core innovation being an Adaptor that adaptively maps CLIP’s semantic embeddings to the object detector domain. Additionally, we introduce inter-class relationship reasoning in detector, which effectively mitigates misclassifications stemming from similar features. Extensive experimental results demonstrate that AdaptFG consistently outperforms state-of-the-art SAOD methods on the PASCAL VOC and MS COCO benchmarks. The code is provided in the supplementary materials.

As Pseudo-Label Free as Possible: Leveraging Adaptive Feature Generation for Sparsely Annotated Object Detection

Recommender systems are increasingly prevalent to provide personalized suggestions and enhance user satisfaction. Typical recommendation models encode users and items as embeddings, and generate recommendations by assessing the similarity between these embeddings. Despite their effectiveness, these embedding-based models struggle with modeling user uncertainty and capturing diverse user interests using a single fixed user embedding. Recent studies have begun to explore a user-distribution paradigm to learn distributions for users. However, this approach employs a single distribution per user, which fails to effectively delineate semantic boundaries, resulting in sub-optimal recommendations. To this end, we propose GCDR, a Guided Conditional Diffusion Recommender model, to learn multiple distributions for each user in this paper. Specifically, GCDR addresses two major challenges: 1) learning disentangled distributions, and 2) learning personalized distributions. GCDR captures inter-user and intra-user distribution properties through conditional and guided diffusion, respectively. It maintains user-specific embeddings to encode long-term interests for conditional diffusion, while for guided diffusion, it incorporates short-term interests encoded from recent interactions with category preferences. To align the diffusion model with the recommendation task, we train GCDR with three loss functions, included the user loss, the recommendation loss and the diffusion loss. Extensive experiments on four real-world datasets show that GCDR is able to learn effective user distributions and is superior to thirteen state-of-the-art baseline methods. The source code is available at https://shorturl.at/Pru2Q.

Learning Multiple User Distributions for Recommendation via Guided Conditional Diffusion

Acquiring pairwise noisy-clean training data is challenging. Consequently, some self-supervised denoising methods utilize noisy image pairs as both input and target for network training. However, a major issue with these methods is the gap between the clean images of the input and target. In this paper, we achieve high-quality image denoising by reducing or even eliminating this gap. Our method, Zero-Shot Noise2Mean, requires no training data or prior knowledge of the noise distribution. It consists of two lightweight networks that can be trained using only a single noisy test image. Specifically, we propose a random mask-based downsampler that generates multiple pairs of downsampled noisy images, which are similar but distinct. These image pairs serve as the input for the first network, with the mean image of each pair used as the target. This initially reduces the gap between the clean images of the input and target. Particularly, in our method, the clean counterpart of the first network's target (i.e., the mean image) can be obtained. We then train a second network using the mean image as input and its clean counterpart as the target. This effectively eliminates the gap and achieves better denoising results. Extensive experiments demonstrate that our method outperforms in both denoising performance and efficiency.

Premium content

Next from AAAI 2025

Training Deep Neural Networks with Virtual Smoothing Classes

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES