United States

With the advancement of large-scale language modeling techniques, large multimodal models combining visual encoders with large language models have demonstrated exceptional performance in various visual tasks. Most of the current large-scale multimodal models achieve this by mapping visual features obtained from the visual encoder into a large language model and using them as inputs alongside text for downstream tasks. Therefore, the number of visual tokens directly affects the training and inference speed of the model. There has been significant work on token pruning for visual transformers, but for large multimodal models, only relying on visual information for token pruning or compression may lead to significant loss of important information. On the other hand, the textual input in the form of a question may contain valuable information that can aid in answering the question, providing additional knowledge to the model. To address the potential oversimplification and excessive pruning that can occur with most purely visual token pruning methods, we propose a text information-guided dynamic visual token recovery mechanism that does not require training. This mechanism leverages the similarity between the question text and visual tokens to recover visually meaningful tokens with important text information while merging other less important tokens. Experimental results demonstrate that our proposed method achieves comparable performance to the original approach while compressing the visual tokens to an average of 10% of the original quantity. Our source code will be made publicly available following acceptance.

AAAI 2025

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information

mult modal vision

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into sub-tasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between sub-task components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between sub-task components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between sub-task components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison frameworks. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, emonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other frameworks, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian.

Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

Neural networks remain as black-box systems, {\em unsure} about their outputs, and their performances may drop unpredictably in real applications. An open question is how to qualitatively extend neural networks that are {\em sure} about their reasoning results, or {\em reason for sure}. In symbolic logic, the validity of {\em reasoning} is guaranteed by {\em a proof} that is a sequence of true-false statements, either assumed true in the task or deduced from previous statements, where statements are either features of or relations between/among entities. 
Here, we introduce set-theoretic relations explicitly and seamlessly into neural networks by extending vector embedding into a sphere, so that part-whole relations can explicitly encode set-theoretic relations through sphere boundaries. {\em A neural proof} turns out to be a process of model construction in the form of a sequential transformation from the premise to the conclusion sphere configurations. We propose the criterion of {\em neural reasoning for sure} and apply it to Aristotelian syllogistic reasoning, the fundamental reasoning system. We implement 
Hyperbolic Sphere Neural Network (HSphNN), the first neural network that has a theoretical proof of reasoning {\em for sure} with syllogistic statements. In experiments, HSphNN reached the {\em reason-for-sure} criterion for all types of syllogistic reasoning and successfully checked both decisions and explanations of ChatGPT (gpt-3.5-turbo and gpt-4o). Through sending feedback using prompts, HSphNN improved the performance of ChatGPT-3.5-turbo from 42.19\% to 63.28\%, and of ChatGPT-4o from 82.42\% to 85.16\%. We show ways to extend HSphNN for logical reasoning and statistical reasoning, and to seamlessly integrate with traditional neural networks.

Neural Reasoning for Sure Through Constructing Explainable Models

Grasping the intricacies of human motion, which involve perceiving spatio-temporal dependence and multi-scale effects, is essential for predicting human motion.
While humans inherently possess the requisite skills to navigate this issue, it proves to be markedly more challenging for machines to emulate. 
To bridge the gap, we propose the Human-like Vision and Inference System (HVIS) for human motion prediction, which is designed to emulate human observation and forecast future movements.
HVIS comprises two components: the human-like vision encode (HVE) module and the human-like motion inference (HMI) module.
The HVE module mimics and refines the human visual process, incorporating a retina-analog component that captures spatiotemporal information separately to avoid unnecessary crosstalk. Additionally,  a visual cortex-analogy component is designed to hierarchically extract and treat complex motion features, focusing on both global and local features of human poses.
The HMI is employed to simulate the multi-stage learning model of the human brain. The spontaneous learning network simulates the neuronal fracture generation process for the adversarial generation of future motions. Subsequently, the deliberate learning network is optimized for hard-to-train joints to prevent misleading learning.
Experimental results demonstrate that our method achieves new state-of-the-art performance, significantly outperforming existing methods by 19.8\% on Human3.6M, 15.7\% on CMU Mocap, and 11.1\% on G3D. Our code is anonymously released.

HVIS: A Human-like Vision and Inference System for Human Motion Prediction

Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for list-decodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest workers, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Under proper assumptions on the loss function, we prove a convergence theorem for our method. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

LiD-FL: Towards List-Decodable Federated Learning

Backward error analysis allows finding a modified loss function, which the parameter updates really follow under the influence of an optimization method. The additional loss terms included in this modified function is called implicit regularizer. In this paper, we attempt to find the implicit regularizer for various federated learning algorithms on non-IID data distribution, and explain why each method shows different convergence behavior. We first show that the implicit regularizer of FedAvg disperses the gradient of each client from the average gradient, thus increasing the gradient variance. We also empirically show that the implicit regularizer hampers its convergence. Similarly, we compute the implicit regularizers of FedSAM and SCAFFOLD, and explain why they converge better. While existing convergence analyses only point out the advantages of FedSAM and SCAFFOLD, our approach can explain their limitations in complex non-convex settings. In specific, we demonstrate that FedSAM can partially remove the bias in the first-order term of the implicit regularizer in FedAvg, whereas SCAFFOLD can fully eliminate the bias in the first-order term, but not in the second-order term. Consequently, the implicit regularizer can provide a useful insight on the convergence behavior of federated learning from a different theoretical perspective.

Convergence Analysis of Federated Learning Methods Using Backward Error Analysis

Inspired by the human brain's ability to adapt to new tasks without erasing prior knowledge, we develop spiking neural networks (SNNs) with dynamic structures for Class Incremental Learning (CIL). Our analytical experiments reveal that limited datasets introduce biases in logits distributions among tasks. Fixed features from frozen past-task extractors can cause overfitting and hinder the learning of new tasks.
To address these challenges, we propose the ALADE-SNN framework, which includes adaptive logit alignment for balanced feature representation and OtoN suppression to manage weights mapping frozen old features to new classes during training, releasing them during fine-tuning. This approach dynamically adjusts the network architecture based on analytical observations, improving feature extraction and balancing performance between new and old tasks.
Experiment results show that ALADE-SNN achieves an average incremental accuracy of 75.42 ± 0.74% on the CIFAR100-B0 dataset over 10 incremental steps. ALADE-SNN not only matches the performance of DNN-based methods but also surpasses state-of-the-art SNN-based continual learning algorithms. This advancement enhances continual learning in neuromorphic computing, offering a brain-inspired, energy-efficient solution for real-time data processing.

ALADE-SNN: Adaptive Logit Alignment in Dynamically Expandable Spiking Neural Networks for Class Incremental Learning

We propose HYBOOD, a hybrid out-of-distribution model based on normalizing flow followed by a simple linear classification model. In real-world settings, it is known that data corruption has a strong influence on model degradation; for example image quality like noise, blur and image geometry like translation, scaling, rotation. MNIST-C, CIFAR10-C are the general synthesized datasets to measure model performance and corruption difficulty in terms of covariate and semantic shifts. 
HYBOOD shows that the separability between in-distribution, covariate shift, and semantic shift can be represented by distribution distance and log-scale density $\log(p(x))$. We also find out the attributes of covariate shifts are ordered by corruption difficulty ranking (CDR) for the datasets. To the best of our knowledge, this is the first method to measure data corruption difficulty with generative models using Wasserstein Distance, Mutual Information and Minimal Description Length. In this paper, we pose interesting experimental results that the MNIST-C trained generative model is most deteriorated by fog, impulse noise and stripe corruption types. This can be interpreted that those attributes are challenging corruptions to the generative model in uncertainty and complexity. By training in-distribution data only, HYBOOD achieves out-of-distribution detection performance for distinguishable covariate and semantic shifts, and quantifying covariate shift ranking.

HYBOOD: A Hybrid Generative Model for Out-of-Distribution Detection with Corruption Estimation

Pretrained visual-language models such as CLIP have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a training-free framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. Our method can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively reduces color, number, and gender biases. Also, it maintains linguistic understanding capability, as demonstrated in numerical results on image-text matching tests with linguistic complexities. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.

MASS: Overcoming Language Bias in Image-Text Matching

Optical neural networks (ONNs) have attracted great attention due to their low power consumption and high-speed processing.
When training an ONN implemented on a chip with possible fabrication variations, the well-known backpropagation algorithm cannot be executed accurately because the perfect information inside the chip cannot be observed.
Instead, we employ a black-box optimization method such as zeroth-order (ZO) optimization.
In this paper, we first discuss how ONN parameters should be perturbed to search for better values in a black-box manner.
Conventionally, parameter perturbations are sampled from a normal distribution with an identity covariance matrix.
This is plausible if the parameters are not interrelated in a module, like a linear module of an ordinary neural network. 
However, this is not the best way for ONN modules with layered parameters, which are interrelated by optical paths.
We then propose to perturb the parameters by a normal distribution with a special covariance matrix computed by our novel method.
The covariance matrix is designed so that the perturbations appearing at the module output caused by the parameter perturbations become as isotropic as possible to uniformly search for better values.
Experimental results show that the proposed method using the special covariance matrix significantly outperformed conventional methods.

Layered-Parameter Perturbation for Zeroth-Order Optimization of Optical Neural Networks

3D Human Pose Estimation (HPE) is a one-to-many problem by nature, making it challenging to estimate an accurate 3D pose from a single 2D pose. Some prior works have attempted to tackle this problem by using a conditional generative network. They generate 3D poses from a given 2D pose with noises from a standard Gaussian distribution, while the depth distribution is dependent on each posture and more complex than the standard Gaussian distribution. This may lead to inaccurate distribution learning. In this paper, we propose a probabilistic framework called ProPose to address this issue. ProPose employs Pose Instance-Level Gaussian Distribution (PILGD) derived from 3D pose-based self-representation learning to obtain reliable distribution which is able to address pose-dependent depth distribution. To access this PILGD, we utilize normalizing flow, which learns a mapping function between the PILGD and a 2D Pose-Adaptive Gaussian Distribution (PAGD). This converts the problem of directly estimating 3D poses from 2D poses to a mapping problem between PILGD and PAGD using a normalizing flow. Extensive experiments show the advantages of utilizing the PILGD and PAGD. ProPose achieves comparable performances to previous state-of-the-art methods in a multi-hypothesis setting. Notably, ProPose in single-hypothesis setting demonstrates comparable generalization ability to existing state-of-the-art deterministic methods for the first time.

Premium content

Next from AAAI 2025

Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES