Singapore

Pansharpening is a powerful technique for generating high-resolution multispectral (HRMS) images by fusing currently available image pairs of low-resolution multispectral (LRMS) and texture-rich panchromatic (PAN) data, effectively addressing the physical constraints of satellite sensors. While recent generative diffusion models have demonstrated impressive performance gains in this domain, their prohibitive computational demands and training costs hinder practicality in resource-constrained remote sensing satellite systems. In this work, we propose NODiff, a novel diffusion framework that replaces the conventional attention-based denoising backbone with a neural operator, seamlessly integrating operator learning and generative modeling into an efficient yet effective solution for pansharpening.
In practice, we implement our approach through a two-stage learning paradigm: First, we pretrain the proposed Neural Operator-based diffusion model to learn the high-resolution texture priors essential for pansharpening. Afterward, we freeze the pretrained parameters, and design a lightweight conditional detail guidance adapter to enable efficient fine-tuning for generating desired HRMS images. Meanwhile, a time-aware low-rank adaptation is introduced to dynamically refine high-frequency details potentially affected by spectral mode truncation. Extensive experiments on multiple benchmark datasets demonstrate that NODiff achieves competitive pansharpening performance while significantly reducing training and inference costs. Beyond pansharpening, our method provides new insights into building resource-efficient generative models.

AAAI 2026

NODiff: Neural Operator Diffusion for Multispectral Image Fusion

neural operator

pansharpening

diffusion models

remote sensing

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Active domain adaptation (ADA) aims to select a small set of target samples for annotation and use them for training to maximally boost the adaptation performance. However, most existing ADA methods only rely on the original output of the model, without considering the relationship between the source and target domain features, which may lead to selecting uninformative samples. In this paper, we propose an effective ADA framework: Prototype-Driven Active Domain Adaptation with density consideration (PDADA). It selects the most valuable target samples in the presence of domain shift through two criteria: Density-Conscious Domainness (DCD) and Prototype-Driven Informativeness (PDI). Furthermore, considering the class imbalance and cluster looseness issues in sample selection and domain adaptation, we develop a Class Balanced Expansion (CBE) algorithm and the Adversarial Active Domain Adaptation via Protecting Structured Information (AADA-PSI). Extensive experiments demonstrate that under the cooperation of the above components, PDADA outperforms previous methods on several challenging benchmarks and can be generalized to multi-source active domain adaptation setting.

Prototype-Driven Active Domain Adaptation with Density Consideration

Although previous deep imputation methods (e.g., Genera-
tive Adversarial Network (GAN) based methods) have been
widely designed to impute missing values, they still suffer
from the issues, i.e., lack of both imputation diversity and
generalization ability. In this paper, we propose a new GAN-
based imputation method, namely Meta-GAIN, to investi-
gate a new generator for achieving diverse imputation and
generalization ability. Specifically, we employ the Kullback-
Leibler (KL) divergence to achieve the diversity of imputed
data by generating continuous embedding space of the origi-
nal data. We also design a task regularizer (i.e., a cross en-
tropy between the predicted results and the true labels) to
push the samples within the same class close and the sam-
ple in different classes far away to achieve generalization
ability. Moreover, we theoretically prove that our proposed
method achieves the generalization ability. In addition, we de-
sign a new meta network to efficient optimize our objective
function. Experimental results on real datasets show that our
proposed method outperforms all comparison methods under
different missing mechanisms in terms of imputation perfor-
mance and classification tasks.

Meta-GAIN for Missing Data Imputation

Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.

DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness.
We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency.
Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

We present 360Explorer, a novel approach for generating 4D controllable panoramic videos conditioned on user-provided 3D instructions for exploring and manipulating dynamic worlds.
Compared to existing perspective-based methods struggle to address spatial consistency during camera rotation in place, we introduce the panoramic view in controllable video generation models to inherently maintain the view recall consistency.
By introducing dynamic point clouds as the 4D scene representations, 360Explorer unifies the modeling of camera transformations and object movements as incomplete renders to describe precise control instructions in 3D worlds.
To tackle the data limitation in acquiring multi-viewpoint panoramic videos, we further propose a reverse warping strategy to construct the training dataset on easily accessible monocular panoramic videos.
Extensive experiments demonstrate that 360Explorer achieves superior performance in creating 4D controllable panoramic videos with camera transformation and object movements aligned with diverse provided instructions.

360Explorer: Exploring 4D Controllable World in Panoramic Videos

Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While recent work highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.

Revisiting Differentiable Structure Learning: Inconsistency of L1 Penalty and Beyond

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs’ ability to perform visually grounded reasoning and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors. Code and data will be released publicly.

Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is included in the supplementary material.

Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation

Joint rendering and deformation of mesh and 3D Gaussian Splatting (3DGS) has significant value as both representations offer complementary advantages for graphics applications.
However, due to differences in representation and rendering pipelines, existing studies render meshes and 3DGS separately, making it difficult to accurately handle occlusions.
Moreover, the deformed 3DGS still suffers from visual artifacts due to the sensitivity to the topology quality of the proxy mesh.
These issues pose serious obstacles to the joint use of 3DGS and meshes, making it difficult to adapt 3DGS to conventional mesh-oriented graphics pipelines.
We propose UniMGS, the first unified framework for rasterizing mesh and 3DGS in a single-pass anti-aliased manner, with a novel binding strategy for 3DGS deformation based on proxy mesh.
Our key insight is to blend the colors of both triangle and Gaussian fragments by anti-aliased $\alpha$-blending in a single pass, achieving visually coherent results with precise handling of occlusion and transparency.
To improve the visual appearance of the deformed 3DGS, our Gaussian-centric binding strategy employs a proxy mesh and spatially associates Gaussians with the mesh faces, significantly reducing rendering artifacts.
With these two components, UniMGS enables the visualization and manipulation of 3D objects represented by mesh or 3DGS within a unified framework, opening up new possibilities in embodied AI, virtual reality, and gaming.
We will release our source code to facilitate further research.

UniMGS: Unifying Mesh and 3D Gaussian Splatting with Single-Pass Rasterization and Proxy-Based Deformation

Large language model (LLM)-based visual dialogue (VD) systems have made response generation for image-grounded conversations more correct and coherent. However, user engagement - the extent to which a user is interested, emotionally involved, and willing to continue the conversation - remains a challenge. To fully explore engaging VD, we propose: (i) a new task named Audio-enhanced VD (AVD), which introduces additional audio dialogue contexts that can more vividly convey the speaker's emotions as input, with the aim of generating correct but more engaging dialogue responses. Specifically, we employ a text-to-speech model as the modality translator to generate the paired acoustic utterances from the inputting textual utterances; (ii) an accompanying approach named Visually-grounded and Interleaved Text-Audio Dialogue Modeling (VITA-DM), which utilizes both image-grounded information and interleaved text-audio utterances for visual dialogue modeling, differentiating from previous multi-modal LLM (MLLM)-based methods that normally model text and audio modalities separately. We also present three pre-training tasks to better learn multi-modal interactions across language, vision, and audio; (iii) a novel metric named Multi-Modal Engagement (MME), which fills the gap of engagement estimation in VD and can provide a fine-grained assessment along emotional, attentional, and reply engagement dimensions (EE, AE, RE). We experiment on two popular datasets and provide extensive evaluations (automatic, engagement-specific, and human), supporting the validity of our approach. Furthermore, based on empirical results that reveal that emotions contribute the most to engagement, we justify our emphasis on the emotional aspect throughout the definition, solution, and evaluation of our task.

Downloads

Next from AAAI 2026

Prototype-Driven Active Domain Adaptation with Density Consideration

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES