Singapore

Robust Multimodal Learning (RML) aims to address the issues of unreliable predictions of multimodal models.
Nevertheless, previous RML works often struggle to distinguish between different categories that rely on identical intra-modal cues, making ambiguous predictions.
We defined this degree of ``uncertain&#39;&#39; in extracting discriminative features of a multimodal model as vagueness.
Neglecting such vagueness, as previous RML works commonly do, will undermine the ability to extract unique semantics of each category in multimodal models, further resulting in worse robustness under disturbances that affect semantic representations.
Additionally, this vagueness will lead the parameter updating processes towards unreliable fusion, thus diverting the learning processes of the multimodal model from learning unique features of each category.
Based on the above insight, we propose a novel robust multimodal learning approach, termed Hyper-Opinion Quantifying Vagueness (HOQV).
Specifically, we first introduce hyper-opinion to capture and quantify the vagueness of multimodal learning in discriminating representations of different categories.
Moreover, to mitigate the interference in parameter updating of unreliable representations with high vagueness, we also design the Hyper-Opinion Gradient Modulation to guide the optimization processes. 
We evaluate our HOQV on six datasets with different disturbances, including noise and adversarial attack, and demonstrate that our proposed method achieves state-of-the-art performance consistently.

AAAI 2026

Hyper-Opinion Vagueness Quantification for Robust Multimodal Learning

hyper-opinion

multimodal learning

Robust Multimodal Learning (RML) aims to address the issues of unreliable predictions of multimodal models.
Nevertheless, previous RML works often struggle to distinguish between different categories that rely on identical intra-modal cues, making ambiguous predictions.
We defined this degree of ``uncertain'' in extracting discriminative features of a multimodal model as vagueness.
Neglecting such vagueness, as previous RML works commonly do, will undermine the ability to extract unique semantics of each category in multimodal models, further resulting in worse robustness under disturbances that affect semantic representations.
Additionally, this vagueness will lead the parameter updating processes towards unreliable fusion, thus diverting the learning processes of the multimodal model from learning unique features of each category.
Based on the above insight, we propose a novel robust multimodal learning approach, termed Hyper-Opinion Quantifying Vagueness (HOQV).
Specifically, we first introduce hyper-opinion to capture and quantify the vagueness of multimodal learning in discriminating representations of different categories.
Moreover, to mitigate the interference in parameter updating of unreliable representations with high vagueness, we also design the Hyper-Opinion Gradient Modulation to guide the optimization processes. 
We evaluate our HOQV on six datasets with different disturbances, including noise and adversarial attack, and demonstrate that our proposed method achieves state-of-the-art performance consistently.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The goal of this work is to adapt Segment Anything Models (SAM) into crack segmentation tasks via automatic label generation, thus eliminating manual annotation cost. In this regard, an intuitive approach is to extract edges of crack samples and generate labels via the dilation and erosion processes for fine-tuning SAM. However, this simple solution cannot guarantee the quality of generated labels, as crack regions will be corrupted due to the imperfect edge detection. To this end, this paper proposes CoGenSAM, a novel Codebook-interactive Generative Labeling framework that enables an annotation-free SAM fine-tuning. To achieve this, in the first stage, we pre-train a vector-quantized variational auto-encoder (VQVAE) by reconstructing the synthesized crack-like structures for learning crack-aware priors within the codebook. In the second stage, these priors help another VQVAE serve as the restoration model to restore the randomly corrupted structures into uncorrupted ones. Specifically, we propose the crack-aware contrastive-interaction to maximize the mutual information with the above priors via codebook interaction. Then, high-quality labels can be generated by restoring corrupted labels from edge detection, contributing to an annotation-free SAM fine-tuning. We collect a new dataset, Bridge2025, to address the limited availability of related bridge-oriented benchmarks. Experiments show that our performance is close to fully-supervised methods.

CoGenSAM: Codebook-Interactive Generative Labeling for Adapting SAM to Crack Segmentation

Precise segmentation of organ and tissue lesions is essential for clinical diagnosis and treatment. Despite the progress of deep learning and foundation segmentation models, their domain generalization capability remains limited particularly when dealing with cross-domain scenarios or unseen data, leading to significant performance degradation. Current medical SAM-based generalization methods face two primary challenges: First, existing prompt-tuning strategies inadequately capture key domain-invariant features; Second, the reliance on fully labeled source domain data is unrealistic in clinical practice. To address these challenges, we propose a novel Dual domain-Invariant Prompt Optimization (DIPO) enhanced by energy-guided augmentation and frequency consistency regularization for few-shot medical image segmentation generalization. Our approach introduces a multi-band momentum enhancement strategy to dynamically augment source data by leveraging diverse frequency bands of the Fourier amplitude spectrum. Furthermore, we integrate multiscale geometric representation-based non-subsampled shearlet transform and text prompts to strengthen the extraction of shape- and texture-related domain-invariant features. Finally, we employ frequency consistency regularization to refine model robustness using predictions from unlabeled data. Experimental results in prostate and fundus datasets demonstrate that our method significantly outperforms current state-of-the-art methods. The codes will be publicly available.

Energy-guided Dual Domain-invariant Prompting Framework with Fourier Regularization for Generalized Few-Shot Medical Segmentation

Understanding the neural basis of three-dimensional (3D) perception is a fundamental objective in cognitive neuroscience. Despite advances in decoding 2D visual stimuli from neural data, reconstructing high-fidelity 3D objects with detailed texture and geometry remains largely unexplored. In this work, we introduce **NeuroSculptor3D**, the first single-stage, end-to-end framework for reconstructing textured 3D shapes directly from brain activity. NeuroSculptor3D integrates a viewpoint-aware brain embedding module that captures fine-grained spatial variations across visual perspectives, and a hierarchical guidance mechanism that aligns brain-derived features with perceptual, semantic, and structural priors. Together, these components facilitate the generation of consistent multi-view embeddings, which are then decoded via TRELLIS to produce high-quality textured 3D reconstructions. Experiments on the fMRI-Shape dataset demonstrate that NeuroSculptor3D outperforms existing baselines across multiple settings, achieving significant improvements in both structural accuracy and semantic consistency. Code will be released to facilitate further research.

Single-Stage fMRI-to-3D Reconstruction via Viewpoint-Aware Embedding and Hierarchical Guidance

Many existing financial math reasoning benchmarks suffer from data contamination and high manual construction costs. To address this, we propose a novel formula-driven approach to dynamically construct math reasoning benchmarks in finance. Our two-stage approach: (1) generates single-formula questions by LLMs using a "Mask-for-Solve" paradigm for ground truth answers, and (2) synthesizes multi-formula questions through hierarchical tree-based DAGs. Our approach ensures novelty (via LLMs' creativity) and controllability of difficulty (via DAG structure). Based on a self-constructed financial formula bank, we utilize the proposed method to build FinMathBench, the first formula-driven and fully LLM-generated benchmark aimed at assessing LLMs' math reasoning abilities in finance, containing 946 questions across 4 complexity levels. Evaluation results on 40 LLMs demonstrate significant accuracy drops in multi-formula questions, e.g., 72.9\% (1-Formula) $\rightarrow$ 14.0\% (4-Formula) for GPT-4o under Chain-of-Thought prompting. Three critical flaws of LLMs are also observed: poor direct calculation performance, bias toward frequently solved variables in formulas, and erroneous "correction" of valid but extreme financial values. These findings highlight gaps in current LLMs' domain-specific reasoning and underscore FinMathBench's value for advancing robust financial LLMs.

FinMathBench: A Formula-Driven Benchmark for Evaluating LLMs’ Math Reasoning Capabilities in Finance

We aim to develop a goal specification method that is semantically clear, spatially sensitive, domain-agnostic, and intuitive for human users to guide agent interactions in 3D environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their camera views rather than the agent’s observations. We highlight that behavior cloning alone fails to align the agent’s behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference $3\times$ to $6\times$. We demonstrate that ROCKET-2 can directly interpret goals from human camera views, enabling better human-agent interaction. Remarkably, ROCKET-2 demonstrates zero-shot generalization capabilities: despite being trained exclusively on the Minecraft dataset, it can adapt and generalize to other 3D environments like Doom, DMLab, and Unreal through a simple action space mapping.

Steering Visuomotor Policy in Open Worlds via Cross-View Goal Alignment

Vector quantization has emerged as a powerful tool in large-scale multimodal models, unifying heterogeneous representations through discrete token encoding. However, its effectiveness hinges on robust codebook design. Current prototype-based approaches relying on trainable vectors or clustered centroids fall short in representativeness and interpretability, even as multimodal alignment demonstrates its promise in vision-language models. To address these limitations, we propose a simple multimodal prompting-driven quantization framework for point cloud analysis. Our methodology is built upon two core insights: 1) Text embeddings from pre-trained models inherently encode visual semantics through many-to-one contrastive alignment, naturally serving as robust prototype priors; and 2) Multimodal prompts enable adaptive refinement of these prototypes, effectively mitigating vision-language semantic gaps. The framework introduces a dual-constrained quantization space, enforced by compactness and separation regularization, which seamlessly integrates visual and prototype features, resulting in hybrid representations that jointly encode geometric and semantic information. Furthermore, we employ Gumbel-Softmax relaxation to achieve differentiable discretization while maintaining quantization sparsity. Extensive experiments on the ModelNet40 and ScanObjectNN datasets clearly demonstrate the superior effectiveness of the proposed method.

Point Cloud Quantization Through Multimodal Prompting for 3D Understanding

Sample selection is a straightforward technique to combat noisy labels, aiming to prevent mislabeled samples from degrading the robustness of neural networks. 
However, existing methods mitigate compounding selection bias either by leveraging dual-network disagreement or additional forward propagations, leading to multiplied training overhead.
To address this challenge, we introduce $\textit{Jump-teaching}$, an efficient sample selection framework for debiased model update and simplified selection criterion. Based on a key observation that a neural network exhibits significant disagreement across different training iterations, Jump-teaching proposes a jump-manner model update strategy to enable self-correction of selection bias by harnessing temporal disagreement, eliminating the need for multi-network or multi-round training. 
Furthermore, we employ a sample-wise selection criterion building on the intra variance of a decomposed single loss for a fine-grained selection without relying on batch-wise ranking or dataset-wise modeling. 
Extensive experiments demonstrate that Jump-teaching outperforms state-of-the-art counterparts while achieving a nearly overhead-free selection procedure, which boosts training speed by up to $4.47\times$ and reduces peak memory footprint by $54\\%$.

Jump-teaching: Combating Sample Selection Bias via Temporal Disagreement

Image-based feature representation plays a critical role in visual localization, enabling robots to estimate their position and orientation in GPS-denied environments. However, this task is often undermined by significant variations in camera viewpoints and scene appearances. Recently, map-free visual relocalization (MFVR) has emerged as a promising paradigm due to its compatibility with lightweight deployment and privacy isolation on mobile devices. In this paper, we propose the Debiased Multiplex Tokenizer (DeMT) as a novel method for versatile and efficient MFVR. Specifically, DeMT performs relative pose regression through an integrated framework built upon a pretrained vision Mamba encoder, comprising three key modules: First, Multiplex Interactive Tokenization yields robust image tokens with non-local affinities and cross-domain descriptions; Second, Debiased Anchor Registration facilitates anchor token matching through proximity graph retrieval and causal pointer debiasing; Third, Orthogonal Pose Regression enhances both pair-wise and multi-view pose regression via Jacobi polynomial parsing of Kolmogorov–Arnold networks. Extensive evaluations across ten public datasets demonstrate that DeMT substantially outperforms existing benchmarks and ablation variants in diverse indoor and outdoor environments. Our code and models will be released upon paper acceptance.

Debiased Multiplex Tokenizer for Efficient Map-Free Visual Relocalization

Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.

GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

Omnidirectional videos (ODVs) provide an immersive visual experience by capturing the 360$^{\circ}$ scene. With the rapid advancements in virtual/augmented reality, metaverse, and generative artificial intelligence, the demand for high-quality ODVs is surging. However, ODVs often suffer from low resolution due to their wide field of view and limitations in capturing devices and transmission bandwidth. Although video super-resolution (SR) is a capable video quality enhancement technique, the performance ceiling and practical generalization of existing methods are limited when applied to ODVs due to their unique attributes. To alleviate spatial projection distortions and temporal flickering of ODVs, we propose a Spatio-Temporal Distortion Aware Network (STDAN) with joint spatio-temporal alignment and reconstruction. Specifically, we incorporate a spatio-temporal continuous alignment (STCA) to mitigate discrete geometric artifacts in parallel with temporal alignment. Subsequently, we introduce an interlaced multi-frame reconstruction (IMFR) to enhance temporal consistency. Furthermore, we employ latitude-saliency adaptive (LSA) weights to focus on regions with higher texture complexity and human-watching interest. By exploring a spatio-temporal jointly framework and real-world viewing strategies, STDAN effectively reinforces spatio-temporal coherence on a novel ODV-SR dataset and ensures affordable computational costs. Extensive experimental results demonstrate that STDAN outperforms state-of-the-art methods in improving visual fidelity and dynamic smoothness of ODVs.

Downloads

Next from AAAI 2026

CoGenSAM: Codebook-Interactive Generative Labeling for Adapting SAM to Crack Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

CoGenSAM: Codebook-Interactive Generative Labeling for Adapting SAM to Crack Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads