United States

Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. Existing multi-modal ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. Additionally, the multi-head mechanism allows the model to assign more dynamic weights to the decoupled experts. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods.

AAAI 2025

DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Missing values in multivariate time series data can harm machine learning performance and introduce bias. These gaps arise from sensor malfunctions, blackouts, and human error. Previous work has addressed missing at random, complete blackouts, and forecasting scenarios. This paper addresses a more general missing pattern, termed $\textbf{partial blackout}$, where a subset of features is missing for consecutive time steps. This scenario is more common in real-world applications. We introduce a two-stage imputation process using self-attention and diffusion processes to model feature and temporal correlations. Our model outperforms state-of-the-art models in partial blackout scenarios and offers better scalability, promising practical data imputation solutions.

Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. 
In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving.
However, videos in such a dynamical scene that often contains complex spatial-temporal movements,
which restricts the generalization capacity of the existing MLLMs in this field.
To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos.
Specifically, our H-MBA consists of two distinct modules,
including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba).
First, C-Mamba contains various types of structure state space models,
which can effectively capture multi-granularity video context for different temporal resolution.
Second, Q-Mamba flexibly transforms the current frame as the learnable query, 
and attentively select multi-granularity video context into query.
Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding.
Via a plug-and-play paradigm in MLLMs,
our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving,
e.g., for risk object detection, it outperforms the previous SOTA method with 5.5\% mIoU improvement.

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (\eg, facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (\eg, paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining

In this study, we consider an optimization problem with uncertainty dependent on decision variables, which has recently attracted attention due to its importance in applications. In this problem, the gradient of the objective function cannot be obtained explicitly because the decision-dependent distribution is unknown. Therefore, several zeroth-order methods have been proposed, which obtain noisy objective values by sampling and update the iterates.  Although these existing methods have theoretical convergence for optimization problems with decision-dependent uncertainty, they require strong assumptions about the function and distribution or exhibit large variances in their gradient estimators. To overcome these issues, we propose two zeroth-order methods under mild assumptions. First, we develop a zeroth-order method with a new one-point gradient estimator including a variance reduction parameter. The proposed method updates the decision variables while adjusting the variance reduction parameter. Second, we develop a zeroth-order method with a two-point gradient estimator. There are situations where only one-point estimators can be used, but if both one-point and two-point estimators are available, it is more practical to use the two-point estimator. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case iteration and sample complexity analysis. Our simulation experiments with real data on a retail service application show that our methods output solutions with lower objective values than the conventional zeroth-order methods.

Zeroth-Order Methods for Nonconvex Stochastic Problems with Decision-Dependent Distributions

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose $\textbf{MotionCraft}$, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design $\textbf{MC-Attn}$ for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce $\textbf{MC-Bench}$, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that $\textbf{MotionCraft}$ achieves state-of-the-art performance on various standard motion generation tasks.

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It promotes shared representations to preserve necessary information for all tasks and task-specific representations to eliminate redundant information for each task, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target task. It adopts an opposite way with existing IB-based MTL to avoid the insufficiency issue. Secondly, a task-specific information minimization (TIMin) is proposed to mitigate the negative effect of potential redundant features in the shared representations for the target task. It alleviates the redundancy issue during the decoding process, which can compress task-irrelevant redundant information and preserves necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks demonstrate that our method outperforms comparative multi-task learning approaches with different PLMs. Extensive experiments show that the shared and output representations are more sufficient, data-efficient, and robust.

An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding

The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train a 1.1B model on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

Enhancing LLMs via High-Knowledge Data Selection

Data sharing is necessary for AI to be widely used, but sharing sensitive data with others with privacy is risky.
To solve these problems, it is necessary to synthesize realistic tabular data.
In many cases, tabular data contains a mixture of continuous, mixed, categorical columns.
Moreover, columns of the same type may have multimodal distribution or be highly imbalanced.
These issues make it challenging to synthesize tabular data.
The synthesized tabular data should reflect the relational meaning between columns of tabular data, so modeling the probability distribution of tabular data is a non-trivial task.
Traditional tabular data synthesizing models are based on GAN or diffusion models and are built using fully connected or convolutional layers.
However, fully connected layers have the disadvantage of low inductive bias, and convolutional layers are not invariant to the column order of tabular data.
Therefore, we assume that converting tabular data into graph-structured data and using a graph neural network would produce better synthetic data than using fully connected layers or convolutional layers.
Our study aims to show that GANs constructed with graph neural networks can outperform existing GAN models using fully connected layers or convolutional layers.
We propose CG-TGAN, a conditional GAN built using graph neural networks. To learn how to synthesize realistic data, the graph neural networks in the discriminator and generator learn graph-level tasks and node-level tasks together.
The discriminator of CG-TGAN learns a graph-level task to distinguish between real and synthetic data and node-level tasks to predict the value of the target node.
CG-TGAN’s generator learns a graph-level task to synthesize an overall graph similar to real data and node-level tasks to learn how to synthesize a fake graph with the proper relation between nodes.
In this paper, we show that CG-TGAN outperforms GAN-based models and is comparable to diffusion-based models.

CG-TGAN: Conditional Generative Adversarial Networks with Graph Neural Networks for Tabular Data Synthesizing

Offline reinforcement methods (RL) aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the behavior policy. Closed form solutions to this problem can be derived as weighted behavioral cloning objectives that, in theory, must compute an intractable partition function. Reinforcement learning has gained popularity in language modeling to align models with human preferences; some recent works consider paired completions that are ranked by a preference model following which the likelihood of the preferred completion is directly increased. We adapt this approach of paired comparison. By reformulating the paired-sample optimization problem, we fit the maximum-mode of the Q function while maximizing behavioral consistency of policy actions. This yields our algorithm, Behavior Preference Regression for offline RL (BPR). We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite, which operates in image-based state spaces. BPR demonstrates state-of-the-art performance over all domains. Our on-policy experiments suggest that BPR takes advantage of the stability of on-policy value functions with minimal performance degradation on Locomotion datasets.

Behaviour Preference Regression for Offline Reinforcement Learning

Scientific discovery serves as the cornerstone for advances in various fields, from the fundamental laws of physics to the intricate mechanisms of biology. 
However, two existing mainstream methods---symbolic regression and dimensional analysis, are significantly limited in this task: the former suffers from low computational efficiency due to the vast search space and often results in formulas without physical meaning; the latter provides a useful theoretical framework but also struggles in searching in a huge space because of lacking effective analysis for the latent variables.
To address this issue, here we propose a framework for efficiently discovering underlying formulas in data, named FIND. We draw inspiration from Buckingham’s Pi theorem, imposing dimensional constraints on the input and output, thereby ensuring discovered expressions possess physical meaning. Additionally, we propose a theoretical scheme for identifying the latent structure as well as a coarse-to-fine framework, significantly reducing the search space of latent variables. This framework not only improves computational efficiency but also enhances model interpretability. From comprehensive experimental validation, FIND showcases its potential to uncover meaningful scientific insights across various domains, providing a robust tool for advancing our understanding of unknown systems.

Premium content

Next from AAAI 2025

Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES