Singapore

Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real-world scenarios involving missing bands, cross-sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment.
To address these limitations, we propose Any-Optical-Model ($AOM$), the first universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, $AOM$ introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, $AOM$ incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships.
Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that $AOM$ consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band-missing, cross-sensor, and cross-resolution settings. These results highlight $AOM$ as a crucial step toward building truly general-purpose RSFMs.

AAAI 2026

Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multimodal emotion recognition plays a crucial role in enhancing the intelligence of human-computer interaction and emotional understanding. However, conventional approaches face challenges such as scarcity of annotated data, significant modality heterogeneity, and temporal misalignment. To address these issues, we propose DHCM-CACL, a novel self-supervised emotion recognition framework integrating EEG and facial expressions. During the pre-training phase, we propose a Dynamic Hierarchical Cross-modal Mamba module (DHCM), which models long-term dependencies through dynamic state matrices, incorporates forgetting gates for noise suppression, and constructs a hierarchical cross-modal interaction structure, effectively achieving cross-modal temporal alignment and mitigating modality heterogeneity. Subsequently, we propose a Confidence-Adaptive Contrastive Learning module (CACL) that dynamically adjusts sample weights using gated confidence signals derived from DHCM to compute loss, prioritizing reliable samples while suppressing noisy instances through adaptive weighting, thereby enhancing representation reliability and low-sample generalization. During the fine-tuning phase, we integrate a cross-modal attention gating mechanism to reinforce temporal associations and adopt an evidence-aware joint optimization objective, providing probabilistic credibility outputs for emotion prediction. Experimental results on the DEAP and MAHNOB-HCI datasets demonstrate that our approach achieves state-of-the-art performance in emotion classification under both subject-dependent and subject-independent settings.

DHCM-CACL: Dynamic Hierarchical Cross-modal Mamba with Confidence-Adaptive Contrastive Learning for Multimodal Emotion Recognition

Recently, End-to-End Speech Translation (E2E-ST) methods leveraging large language models (LLMs) have demonstrated strong generalization capabilities and excellent scalability by integrating pre-trained speech encoders with LLMs, where Low-Rank Adaptation (LoRA) is commonly used for parameter-efficient fine-tuning to reduce training costs. However, LoRA's low-rank assumption often fails in multilingual tasks, as the inherent complexity of cross-lingual semantic relationships and syntactic variations exceeds the representational capacity of low-rank matrices. This leads to parameter conflicts across languages, resulting in suboptimal performance. To address this issue, we propose Mixture of Low-Rank Adaptations (MoLoRA), which integrates the Mixture of Experts (MoE) mechanism with LoRA. MoLoRA effectively enhances the model's expressive capacity while maintaining parameter efficiency during training. Specifically, we treat multiple LoRA modules as low-rank experts and introduce a routing mechanism to dynamically activate language-specific experts. Additionally, shared experts are incorporated and consistently activated to model cross-lingual general knowledge. Furthermore, to enhance the robustness and accuracy of speech representations, we propose a Multi-Granularity Representation Fusion module (MGRF). This module mitigates local distortions in frame-level speech representations caused by noise by fusing frame-level and sentence-level features, thereby providing the LLM with more accurate high-level semantic information. We conduct multilingual experiments on the MuST-C and CoVoST-2 datasets. Our method achieves an average BLEU score of 32.2 across eight language pairs on the MuST-C dataset and an average of 36.3 across three language pairs on the CoVoST-2 dataset, establishing a new state-of-the-art (SOTA) performance.

MoLoRA: Boosting LLM-based End-to-end Speech Translation with Mixture of Low-rank Experts

Recommending event schedules is a key issue in Event-based Social Networks (EBSNs) in order to maintain user activity. An effective recommendation is required to maximize the user's preference, subjecting to both time and geographical constraints. Existing methods face an inherent trade-off among efficiency, effectiveness, and generalization, due to the NP-hard nature of the problem. This paper proposes the Chain-of-Scheduling (CoS) framework, which activates the event scheduling capability of Large Language Models (LLMs) through a guided, efficient scheduling process. CoS enhances LLM by formulating the schedule task into three atomic stages, i.e., exploration, verification and integration. Then we enable the LLMs to generate CoS autonomously via Knowledge Distillation (KD). Experimental results show that CoS achieves near-theoretical optimal effectiveness with high efficiency on three real-world datasets in a interpretable manner. Moreover, it demonstrates strong zero-shot learning ability on out-of-domain data.

CoS: Towards Optimal Event Scheduling via Chain-of-Scheduling

Rejoining fragment images of precious artifacts is a meaningful task because complete artifacts could provide valuable clues for the research of human civilization. However, existing rejoining methods face several challenges including time-consuming manual annotation, insufficient rejoining accuracy, and prohibitive computation cost. For rejoining fragment images of bone sticks (a precious artifact), we propose a lightweight vision graph neural network called RejoinViG to address these challenges. First, our method avoids time-consuming manual annotation of ballast contour data by experts. Specifically, our method directly takes a pair of fragment images as input and then determines whether the image pair is rejoinable. Second, our method improves rejoining accuracy by contour, script, and texture through dynamically constructing local and global graphs. Third, our method improves rejoining accuracy while reducing computation cost by introducing a new attention mechanism named node self-attention. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods significantly. For example, the Top-1 accuracy of our method is 3.9 times that of SFF-Siam. Surprisingly, our method successfully rejoins a pair of previously unknown but rejoinable fragment images of bone sticks in a real-world scenario.

Rejoining Precious Artifacts: Efficiently Bone Stick Rejoining Based Massive Fragment Images by Contour, Script, and Texture

Neuro-symbolic learning has emerged as a promising paradigm for interpretable visual reasoning, where mapping natural language questions to executable programs plays a central role. However, most existing methods focus exclusively on the forward program generation from questions while overlooking the reverse process of reconstructing questions from programs. In this paper, we propose BiPaR (Bidirectional Parsing and Reconstruction), a Transformer-based framework that jointly models both program parsing and question reconstruction within a unified architecture. Unlike previous approaches that only perform forward parsing, BiPaR introduces reverse program-to-question reconstruction as a powerful auxiliary signal, which improves program generation quality and accelerates convergence, particularly under limited supervision. We further provide a theoretical analysis showing how reverse reconstruction facilitates faster optimization during training. The bidirectional modeling makes BiPaR well-suited for both supervised and semi-supervised learning scenarios. We present two architectural variants: BiPaR-Full, which employs encoder-decoder Transformers for both modules; and BiPaR-DOnly, a lightweight variant that employs a decoder-only structure for question reconstruction, reducing model complexity. Experiments on CLEVR and a GQA subset demonstrate that BiPaR significantly outperforms standard Transformer baselines. Furthermore, in the semi-supervised learning setting, BiPaR achieves notable improvements by leveraging additional questions without program annotations.

Learning to Parse and Reconstruct: Bidirectional Modeling of Question-to-Program Mapping

Deep learning models excel in visual recognition but suffer severe performance drops when training labels are corrupted by noise. Under label noise prior work cannot learn accurate similarities and thus misguide the learning process. In this paper, we uncover a complementary and novel phenomenon, Dissimilarity Invariance, whereby semantic dissimilarity between unrelated samples remains stable despite label noise. Leveraging this insight, we propose NegScale, a plug-and-play framework that shifts focus from fragile similarity to robust dissimilarity. NegScale integrates: (1) Structured Negative Orthogonality Penalty (SNOP), enforcing subspace orthogonality for unrelated samples; and (2) Dissimilarity-Calibrated Similarity Adjustment (DCSA), suppressing spurious similarity using dissimilarity anchors. We also give theoretical analysis that proves Dissimilarity Invariance and the effectiveness of NegScale. Empirical results demonstrate that NegScale consistently outperforms state-of-the-art baselines, establishing new benchmarks on CIFAR with synthetic noise and real-world datasets.

Leveraging Dissimilarity Invariance as a Robust Anchor for Learning with Noisy Labels

Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. The modular design of action encoder and decoder components enables effective knowledge transfer from the unified human embodiment to diverse robot platforms through efficient fine-tuning. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including $\boldsymbol{\pi}_0$ and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Anchor-based 3D Gaussian Splatting (GS), exemplified by Scaffold-GS, achieves remarkable storage efficiency through a hybrid explicit-implicit representation. However, their reliance on a single, monolithic network to decode anchor features imposes a severe bottleneck on model capacity, often resulting in blurred details and view-dependent artifacts in complex scenes. To break this bottleneck, we introduce the concept of Scene Experts: a strategy that decomposes the task of modeling a complex scene across a collection of specialized sub-models. To realize the paradigm, we propose MoE-GS. Our approach designs the decoder as a Sparsely-Gated Mixture of Experts (MoE), which dramatically increases the model's total capacity while maintaining comparable inference cost via sparse activation. To effectively train this high-capacity model, we propose two key innovations: (1) A progressive curriculum learning strategy that first trains all experts on a robust baseline before encouraging them to specialize on different scene components. (2) A novel opacity-aware regularization that penalizes inactive neural Gaussians, ensuring the expanded capacity is efficiently used. Extensive experiments demonstrate that MoE-GS substantially outperforms state-of-the-art methods on diverse benchmarks, significantly improving reconstruction fidelity while requiring a smaller or comparable Gaussian model size. The code is included in the supplementary material.

Scene Experts: Specializing in 3D Gaussian Splatting with Adaptive Decomposition

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces—producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths—achieving an over 17× increase on the MATH-500 dataset—while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is **ACTOR** (**A**ction-aware **C**ross-modal **T**ransf**OR**mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a **P**erceptual **D**istilled **Q**uery **D**ecoder (**PDQD**), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.

Next from AAAI 2026

DHCM-CACL: Dynamic Hierarchical Cross-modal Mamba with Confidence-Adaptive Contrastive Learning for Multimodal Emotion Recognition

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES