Singapore

Multimodal remote sensing image joint classification has achieved significant progress. However, existing methods primarily focus on designing modality-specific networks, lacking adaptive generalization capabilities in diverse and dynamic modality combinations encountered in real-world scenarios. Inspired by the generalization capabilities of visual foundation model in downstream tasks, we propose a unified Text-guided Arbitrary Modalitiy Prompting (T-APT) framework, which leverages complementary fused features to drive the foundation model and employs text-guided modality-specific prior knowledge as cross-modal prompts to fine-tune a pretrained Vision Transformer (ViT) model. Specifically, a Mamba-Based Arbitrary Modal-Focused Feature Capture (MAMF-FC) module is designed to extract complementary joint features and modality-specific prior knowledge from arbitrary modalities through a shared-specific scanning encoder-decoder architecture. Subsequently, a Text-Guided Modality-Aware Prompt Tuning (TMPT) module is proposed to support the adaptation of fused features to the foundation model, enabling our arbitrary remote sensing image classification task. Extensive experiments on public datasets spanning multispectral (MS), hyperspectral (HS), light detection and ranging (LiDAR), and synthetic aperture radar (SAR) modalities demonstrate that our T-APT achieves classification performance comparable to specialized networks across arbitrary modal combinations.

AAAI 2026

T-APT: Text-Guided Modality-Aware Prompt Tuning for Arbitrary Multimodal Remote Sensing Data Joint Classification

arbitrary modality

joint classification

prompt tuning

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post-training. Among existing methods, Group Relative Policy Optimization (GRPO) stands out for its efficiency, leveraging groupwise relative rewards while avoiding costly value function learning. However, GRPO treats candidate responses as independent, overlooking semantic interactions such as complementarity and contradiction. To address this challenge, we first introduce a Structural Causal Model (SCM) that reveals hidden dependencies among candidate responses induced by conditioning on a final integrated output—forming a collider structure. Then, our causal analysis leads to two insights: (1) projecting responses onto a causally-informed subspace improves prediction quality, and (2) this projection yields a better baseline than query-only conditioning. Building on these insights, we propose Group Causal Policy Optimization (GCPO), which integrates causal structure into optimization through two key components: a causally-informed reward adjustment and a novel KL-regularization term that aligns the policy with a causally-projected reference distribution. Comprehensive experimental evaluations demonstrate that GCPO consistently surpasses existing methods—including GRPO—across multiple reasoning benchmarks.

Group Causal Policy Optimization for Post-Training Large Language Models

Hyperedge prediction plays a central role in hypergraph learning, enabling the inference of high-order relations among multiple entities. However, existing methods often rely on a simplistic \emph{flat set assumption}, treating candidate hyperedges as unstructured collections of nodes and neglecting their potential internal compositionality. Furthermore, the severe scarcity of observed hyperedges poses a challenge for effective supervision. In this work, we propose **S$^3$Hyper**, a **S**ubstructure-contextualized **S**elf-**S**upervised framework for **Hyper**edge prediction, which jointly addresses these two challenges. Specifically, we design a substructure-contextualized hyperedge aggregator that models the internal hierarchy of candidate hyperedges by leveraging sub-hyperedge information. In parallel, we introduce an adaptive tri-directional contrastive learning module that incorporates node-level, hyperedge-level, and cross-level alignment objectives, supported by temperature-adaptive mechanisms. Experimental results on four public datasets demonstrate that S$^3$Hyper consistently outperforms strong baselines, with ablation studies verifying the effectiveness of each component.

Self-Supervised Hypergraph Learning with Substructure Awareness for Hyperedge Prediction

Recent advancements in end-to-end autonomous driving systems (ADSs) underscore their potential for perception and planning capabilities. However, challenges remain. Complex driving scenarios contain rich semantic information, yet ambiguous or noisy semantics can compromise decision reliability, while interference between multiple driving tasks may hinder optimal planning. Furthermore, prolonged inference latency slows decision-making, increasing the risk of unsafe driving behaviors. To address these challenges, we propose ExpertAD, a novel framework that enhances the performance of ADS with Mixture of Experts (MoE) architecture. We introduce a Perception Adapter (PA) to amplify task-critical features, ensuring contextually relevant scene understanding, and a Mixture of Sparse Experts (MoSE) to minimize task interference during prediction, allowing for effective and efficient planning. Our experiments show that ExpertAD reduces average collision rates by up to 20% and inference latency by 25% compared to prior methods. We further evaluate its multi-skill planning capabilities in rare scenarios (e.g., accidents, yielding to emergency vehicles) and demonstrate strong generalization to unseen urban environments. Additionally, we present a case study that illustrates its decision-making process in complex driving scenarios. Codes are included in the supplementary material.

ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts

News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce \textbf{MERGE}, the first \textbf{M}ultimodal \textbf{E}ntity-aware \textbf{R}etrieval-augmented \textbf{GE}neration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

To efficiently solve exact discrete optimization problems, branch and bound algorithms require tight bounds. In constraint programming, for optimization, soft arc consistencies typically derive much stronger bounds than those offered by domain or bound consistencies applied to a cost variable. The reason is that soft local consistencies exchange marginal cost information between variables whereas domain consistencies rely only on shrinking domains, which is less informative. However, CP solvers equipped with soft arc consistencies have so far offered limited support for efficient global constraints processing.

In this work, we show how we can efficiently enforce soft local consistency over the AllDifferent constraint, relying on algorithms for the Linear Assignment problem (LAP). We implement this propagator in toulbar2, the state-of-the-art weighted CP solver exploiting soft local consistencies for bounding. On problems that include AllDifferent constraints, we show that, equipped with this new propagator, toulbar2 outperforms state-of-the-art domain consistency-based CP as well as integer programming solvers for the Quadratic Assignment Problem and shows better overall performance in the miniCOP track of the 2024 XCSP competition.

Assignment Problems in Cost Function Networks

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.

TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs

Cross-modal hashing (CMH) achieves remarkable success in large-scale cross-modal retrieval thanks to its low storage cost and high computational efficiency. However, existing CMH methods typically rely on accurately annotated data. Due to the high annotation cost and the time-consuming process, this inevitably leads to noisy labels. Although some noisy label learning methods have been proposed to mitigate performance degradation caused by annotation errors, they do not reflect the real annotation process. In practice, annotators tend to assign a candidate label set to a sample pair instead of a single unreliable label, due to inherent ambiguity and noise in real-world data annotation. In this paper, we investigate a less-touched yet meaningful problem, i.e., cross-modal hashing with partial labels (PLCMH). PLCMH faces the dual challenges of label ambiguity and modality-alignment barriers caused by redundant noisy labels. To address these issues, we propose a novel approach named Ambiguity-Tolerant Cross-Modal Hashing (ATCH). Specifically, ATCH presents a Local Consensus Disambiguation (LCD) mechanism that resolves the label ambiguity by computing the sample pair label confidence based on local consensus. Moreover, ATCH proposes a Confidence-Aware Contrastive Hashing (CACH) mechanism that derives both pseudo labels and trustworthiness scores from the label confidence vector to learn discriminative hash codes, leading to effective modality alignment. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed ATCH.

Ambiguity-Tolerant Cross-Modal Hashing with Partial Labels

Deep multi-view clustering (MVC) methods achieve impressive performance by effectively capturing complementary information across views, where feature fusion serves as the critical mechanism for maximizing cross-view complementarity. However, most existing methods suffer from rigid dependence on non-adaptive predefined fusion operations, resulting in unverifiable and potentially suboptimal fused feature quality. To resolve these limitations, we propose a novel multi-view clustering framework that learns adaptive hierarchical fusion through an unsupervised evolutionary algorithm. Unlike conventional predefined-fusion strategies, our approach employs tree-structured representations (Fusion Trees) for adaptive feature integration. These Fusion Trees are optimized via our evolutionary mechanism, in which models sharing identical architectures but distinct Fusion Trees are conceptualized as evolutionary individuals. Through implementation of the evolutionarily optimized Fusion Tree, the resultant model generates discriminative representations in accordance with biological evolutionary principles. Comprehensive benchmarking across twelve multi-view datasets validates significant performance gains improvement over state-of-the-art baselines.

Adaptive Evolutionary Fusion for Multi-View Clustering

Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Artificial intelligence systems in critical fields like autonomous driving and medical imaging analysis often continually learn new tasks using a shared stream of input data. For instance, after learning to detect traffic signs, a model may later need to learn to classify traffic lights or different types of vehicles using the same camera feed. This scenario introduces a challenging setting we term Continual Multitask Learning (CMTL), where a model sequentially learns new tasks on an underlying data distribution without forgetting previously learned abilities. Existing continual learning methods often fail in this setting because they learn fragmented, task-specific features that interfere with one another. To address this, we introduce Learning with Preserving (LwP), a novel framework that shifts the focus from preserving task outputs to maintaining the geometric structure of the shared representation space. The core of LwP is a Dynamically Weighted Distance Preservation (DWDP) loss that prevents representation drift by regularizing the pairwise distances between latent data representations. This mechanism of preserving the underlying geometric structure allows the model to retain implicit knowledge and support diverse tasks without requiring a replay buffer, making it suitable for privacy-conscious applications. Extensive evaluations on time-series and image benchmarks show that LwP not only mitigates catastrophic forgetting but also consistently outperforms state-of-the-art baselines in CMTL tasks. Notably, our method shows superior robustness to distribution shifts and is the only approach to surpass the strong single-task learning baseline, underscoring its effectiveness for real-world dynamic environments.

Downloads

Next from AAAI 2026

Group Causal Policy Optimization for Post-Training Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES