Singapore

Cross-modal retrieval is a fundamental application of multi-modal learning that has achieved remarkable success with large-scale well-paired data. However, in practice, it is costly to collect large-scale well-paired data. To alleviate the dependence on the amount of paired data, in this paper, we study a practical learning paradigm: semi-paired cross-modal learning (SPL), which utilizes both a small amount of paired data and a large amount of unpaired data to enhance cross-modal learning directly and is more accessible in practice. To achieve this, we take image-text retrieval as an example and propose a novel Robust Cross-modal Semi-paired Learning method (RCSL) by addressing two challenges. To be specific, i) to overcome the under-optimization issue caused by too little paired data, we present Semi-paired Discriminative Learning (SDL) to fully learn visual-semantic associations from a small amount of image-text pairs by preserving the alignment and uniformity of modality representations. ii) To mine visual-semantic correspondences from unpaired data, RCSL first constructs pseudo-paired correlations across different modalities by nearest neighbor association. However, this may introduce noisy correspondences (NCs) due to inaccurate pseudo signals, which could degrade the model&#39;s performance. To tackle NCs, we devise Robust Cross-correlation Mining (RCM) based on the risk minimization criterion to robustly and explicitly learn visual-semantic associations from pseudo-paired data, thus boosting cross-modal learning. Finally, we conduct extensive experiments on four datasets, i.e., three widely used benchmark datasets Flickr30K, MS-COCO, CC152K, and a newly constructed real-world dataset Drone-SP, to demonstrate the effectiveness of RCSL under semi-paired and noisy settings.

AAAI 2026

Robust Semi-paired Multimodal Learning for Cross-modal Retrieval

ml: multimodal learnin

cv: image and video retrieval

cv: language and vision

Cross-modal retrieval is a fundamental application of multi-modal learning that has achieved remarkable success with large-scale well-paired data. However, in practice, it is costly to collect large-scale well-paired data. To alleviate the dependence on the amount of paired data, in this paper, we study a practical learning paradigm: semi-paired cross-modal learning (SPL), which utilizes both a small amount of paired data and a large amount of unpaired data to enhance cross-modal learning directly and is more accessible in practice. To achieve this, we take image-text retrieval as an example and propose a novel Robust Cross-modal Semi-paired Learning method (RCSL) by addressing two challenges. To be specific, i) to overcome the under-optimization issue caused by too little paired data, we present Semi-paired Discriminative Learning (SDL) to fully learn visual-semantic associations from a small amount of image-text pairs by preserving the alignment and uniformity of modality representations. ii) To mine visual-semantic correspondences from unpaired data, RCSL first constructs pseudo-paired correlations across different modalities by nearest neighbor association. However, this may introduce noisy correspondences (NCs) due to inaccurate pseudo signals, which could degrade the model's performance. To tackle NCs, we devise Robust Cross-correlation Mining (RCM) based on the risk minimization criterion to robustly and explicitly learn visual-semantic associations from pseudo-paired data, thus boosting cross-modal learning. Finally, we conduct extensive experiments on four datasets, i.e., three widely used benchmark datasets Flickr30K, MS-COCO, CC152K, and a newly constructed real-world dataset Drone-SP, to demonstrate the effectiveness of RCSL under semi-paired and noisy settings.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Crowdsourcing is a common approach for training data-hungry models by collecting high-quality labeled data with human labor. With crowdsourcing data, the end-to-end learning paradigm is rising, where the classifier is concatenated with annotator-specific confusion layers and the two parts are co-trained in a parameter-coupled manner. However, learning with the size of a very large set of annotations is a challenge when computation or energy is limited. In this paper, we analyze and refine the coresets for end-to-end learning from crowds under the sensitivity sampling framework. This coreset is a small possible subset of annotations, so one can efficiently optimize the Coupled Cross-Entropy Minimization (CCEM) problem with guaranteed approximation. We first prove the lower bound, which shows no coresets smaller than complete data with confusion layers. Then, we show that with the regularization term $\log\det{A}_r^\top\{A}_r$, this lower bound can be prevented. Our main result is that under mild assumptions, a smaller coreset exists for the regularized CCEM problem. An upper bound of sensitivity is proposed for designing a sampling algorithm called CrowdCore. The experimental results on synthetic and real-world datasets demonstrate the effectiveness of our analysis.

On Coresets for End-to-end Learning from Crowds

Understanding how chemical perturbations propagate through biological systems is essential for robust molecular property prediction. While most existing methods focus on chemical structures alone, recent advances highlight the crucial role of cellular responses such as morphology and gene expression in shaping drug effects. However, current cell-aware approaches face two key limitations: (1) modality incompleteness in external biological data, and (2) insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels. We propose CHMR (Cell-aware Hierarchical Multi-Modal Representations), a robust framework that jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module. Evaluated on public benchmarks spanning 696 tasks, CHMR outperforms state-of-the-art baselines, yielding average improvements of 3.6% on classification and 17.2% on regression tasks. These results demonstrate the advantage of hierarchy-aware, multi-modal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling.

Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling

In recent years, the rapid progress of deep learning has driven notable advancements in satellite video tracking, a critical task for applications such as environmental monitoring, disaster management, and defense. Despite these strides, existing approaches remain constrained by their inability to handle dynamic challenges, such as target appearance variations, complex motion patterns, and occlusions. Traditional methods often suffer from static template matching or overly complex update mechanisms, compromising their robustness and practicality in real-world scenarios. To address these limitations, we propose a paradigm shift in satellite video tracking by integrating historical trajectory knowledge with visual features. This fusion enhances the tracker's perceptual understanding of targets over time, enabling more adaptive and resilient tracking. By aligning spatial, temporal, and cross-modal information, our approach effectively bridges the gap between fragmented observations and coherent tracking performance, even under challenging conditions like small target detection and cluttered backgrounds. Extensive experiments conducted on multiple satellite video tracking benchmarks demonstrate the superiority of our method, with HTTrack achieving success rates of 51.5\% on SV248S, 52.9\% on SatSOT, and 32.6\% on VISO, significantly outperforming state-of-the-art trackers, marking a step forward in achieving robust, accurate, and scalable satellite video tracking.

HTTrack: Learning to Perceive Targets via Historical Trajectories in Satellite Video Tracking

With the growing demand for decentralized collaborative analysis of privacy-sensitive data, federated multi-view clustering (FMVC) has attracted widespread attention due to its ability to balance privacy protection and collaborative modeling. However, current methods still face the following challenges: (1) Clients need to frequently upload high-dimensional data such as model parameters or graph structures, resulting in high communication costs; (2) The structured data uploaded often contains semantic features and has a high risk of being inverted; (3) The server usually merges the data from all clients with the fixed fusion rule, which may result in a suboptimized clustering result when there exist low-quality clients. To address the issues, we propose a new trusted federated multi-view clustering framework (EvoFMVC) that introduces three key innovations: First, lightweight trusted evidence serves as a compact communication medium, significantly reducing overhead compared to conventional model parameters or graph structures. Second, trusted evidences express clustering results in the form of probability distribution, which avoids the risk of structured information being easily inverted. Lastly, we formalize the server-side aggregation process as a neural architecture search (NAS) task where the server flexibly uses different fusion operators to filter and fuse necessary views through evolutionary algorithms, which significantly improves the fusion effect and model performance. Experimental results on multiple datasets show that our method is superior to existing FMVC methods in terms of clustering accuracy and communication efficiency. (The source code will be published.)

EvoFMVC: Trusted Federated Multi-View Clustering with Evolutionary Fusion

Over-smoothing remains a fundamental challenge in deep Graph Neural Networks (GNNs), where repeated message passing causes node representations to become indistinguishable. While existing solutions, such as residual connections and skip layers, alleviate this issue to some extent, they fail to explicitly model how node representations evolve in a node-specific and progressive manner across layers. Moreover, these methods do not take global information into account, which is also crucial for mitigating the over-smoothing problem. To address the aforementioned issues, in this work, we propose a Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN), which is a novel framework that integrates Mamba into GNNs to address over-smoothing from both local and global perspectives. DMbaGCN consists of two modules: the Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and utilizing Mamba’s selective state space modeling to capture node-specific representation dynamics across layers, and the Global Context-Aware Mamba (GCAMba) that leverages Mamba’s global attention capabilities to incorporate global context for each node. By combining these components, DMbaGCN enhances node discriminability in deep GNNs, thereby mitigating over-smoothing. Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of our method.

Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the \textbf{P}rompt-level \textbf{U}ser \textbf{M}igration \textbf{A}dapter (\textbf{PUMA}), a lightweight framework that efficiently migrates personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to drastically reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost up to 98\%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.

Don’t Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. 
This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk n-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. 
Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Control in high-dimensional action spaces remains a fundamental challenge in reinforcement learning (RL), primarily due to inefficient exploration of the action space. While recent methods attempt to guide exploration, they often fall short of achieving the agility and coordination exhibited in biological motor control. Inspired by how organisms exploit muscle synergies for efficient movement, we propose Explore to Learn (ETL), a two-stage framework that first discovers fundamental synergy patterns and then leverages them for task-specific policy learning.
In the first stage, ETL discovers underlying synergy patterns by deploying a targeted exploration policy. These patterns are modeled as latent directions in a low-dimensional space, along which the agent is guided to collect diverse and structured muscle activation trajectories. A variational autoencoder (VAE) is then trained to encode high-dimensional actions into a latent space whose dimensions correspond to the synergy patterns.
In the second stage, the policy is trained entirely in this synergy-aware latent space, producing synergy coefficients that the decoder maps back to full-dimensional muscle actions. This structured representation significantly reduces the complexity of learning, while the decoder is further fine-tuned to enhance expressiveness and generalization across downstream tasks.
Extensive experiments across musculoskeletal environments and the DMControl suite demonstrate that ETL consistently outperforms prior methods in both exploration efficiency and control performance, achieving superior scalability and generalization in overactuated control tasks.

Explore to Learn: Latent Exploration Through Disentangled Synergy Patterns for Reinforcement Learning in Overactuated Control

Video captions play a crucial role in text-to-video generation tasks, as their quality directly influences the semantic coherence and visual fidelity of the generated videos. Although large vision-language models (VLMs) have demonstrated significant potential in caption generation, existing benchmarks inadequately address fine-grained evaluation, particularly in capturing spatial-temporal details critical for video generation. To address this gap, we introduce the Fine-grained Video Caption Evaluation Benchmark (\textbf{VCapsBench}), the first large-scale fine-grained benchmark comprising 5,677 (5K+) videos and 109,796 (100K+) question-answer pairs. These QA-pairs are systematically annotated across 21 fine-grained dimensions (\eg, camera movement, and shot type) that are empirically proven critical for text-to-video generation. We further introduce three metrics (Accuracy (\textbf{AR}), Inconsistency Rate (\textbf{IR}), Coverage Rate (\textbf{CR})), and an automated evaluation pipeline leveraging a large language model (LLM) to verify caption quality via contrastive QA-pairs analysis. Our benchmark can advance the development of robust text-to-video models by providing actionable insights for caption optimization. The dataset and codes are available at the website: \url{https://github.com/anoycode22/VCapsBench}.

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation

Medication recommendation systems aim to provide personalized and safe medication options based on individual patient records. However, existing approaches often face challenges related to inadequate modeling of complex relationships within Electronic Health Records (EHRs), data sparsity, and a lack of explainability for recommendations. In this paper, we present a Knowledge-enhanced Explainable HyperGraph Convolution Network (KEHGCN) that constructs a hierarchical hypergraph structure to capture the multi-level relationships within EHR data. By incorporating external knowledge graphs, our approach introduces additional positive relations that help alleviate the impact of data sparsity on model learning. Furthermore, by performing generalized metapath construction and selection on the knowledge graph, our approach achieves effective knowledge filtering and extracts semantically meaningful metapaths, thereby further enhancing the explainability of the recommendation results. We also explicitly introduce negative relations present in the domain knowledge to improve the safety of medication recommendation. Extensive experiments on different hospital departments of MIMIC-III and MIMIC-IV datasets demonstrate that KEHGCN outperforms other state-of-the-art baselines.

Downloads

Next from AAAI 2026

On Coresets for End-to-end Learning from Crowds

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

On Coresets for End-to-end Learning from Crowds

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads