Singapore

Feed-forward 3D reconstruction from sparse, low-resolution (LR) images is a crucial capability for real-world applications, such as autonomous driving and embodied AI. However, existing methods often fail to recover fine texture details. This limitation stems from the inherent lack of high-frequency information in LR inputs. To address this, we propose SRSplat, a feed-forward framework that reconstructs high-resolution 3D scenes from only a few LR views. Our main insight is to compensate for the deficiency of texture information by jointly leveraging external high-quality reference images and internal texture cues. We first construct a scene-specific reference gallery, generated for each scene using Multimodal Large Language Models (MLLMs) and diffusion models. To integrate this external information, we introduce the Reference-Guided Feature Enhancement (RGFE) module, which aligns and fuses features from the LR input images and their reference twin image. Subsequently, we train a decoder to predict the Gaussian primitives using the multi-view fused feature obtained from RGFE. To further refine predicted Gaussian primitives, we introduce Texture-Aware Density Control (TADC), which adaptively adjusts Gaussian density based on the internal texture richness of the LR inputs. Extensive experiments demonstrate that our SRSplat outperforms existing methods on various datasets, including RealEstate10K, ACID, and DTU, and exhibits strong cross-dataset and cross-resolution generalization capabilities. Our code and video demos can be found in the supplementary materials.

AAAI 2026

SRSplat: Feed-Forward Super-Resolution Gaussian Splatting from Sparse Multi-View Images

vision for robotics & autonomous driving

image & video synthesis

3d computer vision

computational photography

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Image Aesthetics Assessment (IAA) evaluates visual quality through user-centered perceptual analysis and can guide various applications. Recent advances in Multimodal Large Language Models (MLLMs) have sparked interest in adapting them for IAA. However, two critical limitations persist in applying MLLMs to IAA: 1) the tokenization strategy leads to insensitivity to scores, and 2) the classification-based decoding mechanisms introduce score quantization errors. Current MLLM-based IAA methods treat the task as coarse rating classification followed by probability-to-score mapping, which loses fine-grained information. To address these challenges, we propose ROC4MLLM, offering complementary solutions from two perspectives:1) Representation: We separate scores from the word token space to avoid tokenizing scores as text. An independent position token bridges these spaces, improving the sensitivity of the model to score positions in text. 2) Computation: We apply distinct loss functions for text and score predictions to enhance the sensitivity of the model to score gradients. Decoupling scores from text ensures effective supervision while preventing interference between scores and text in the loss computation. Extensive experiments across five datasets demonstrate that ROC4MLLM achieves state-of-the-art performance without requiring additional training data. Additionally, its plug-and-play design ensures seamless integration with existing MLLMs, boosting their IAA performance. All resources are available in here.

Regression over Classification: Assessing Image Aesthetics via Multimodal Large Language Models

Large Language Models (LLMs) are increasingly integral to recommendation systems, offering sophisticated language understanding and generation capabilities. However, their practical application is often hindered by challenges such as data sparsity, the generation of unreliable or hallucinated recommendations, and a general lack of transparency in their decision-making processes. Existing mitigation strategies frequently introduce significant complexity or computational overhead. To address these limitations, particularly the critical gap in quantifying the confidence of LLM-generated recommendations, we propose **GUIDER**: Uncertainty Guided Dynamic Re-ranking for Large Language Models Based Recommender Systems. This new framework innovatively leverages the logits produced by LLMs as evidence for recommended items. By employing a Dirichlet distribution, GUIDER decomposes the total predictive uncertainty into distinct Data Uncertainty (DU), reflecting inherent data ambiguity, and Model Uncertainty (MU), indicating the model's own conviction. This principled decomposition, achieved with a single inference pass, enhances transparency and trustworthiness. Based on the quantified DU and MU levels, our system dynamically adapts its recommendation strategy---adjusting output diversity, explanation depth, or invoking fallback mechanisms---through a four-quadrant analysis that tailors responses to specific uncertainty profiles. Extensive experiments conducted in zero-shot recommendation settings validate the effectiveness of our approach. GUIDER consistently outperforms existing methods in reliability-aware scenarios, demonstrably improving recommendation quality. This framework not only advances the practical deployment of LLM-based recommenders by making them more dependable but also provides a robust foundation for future research into uncertainty-aware generative systems.

GUIDER: Uncertainty Guided Dynamic Re-ranking for Large Language Models Based Recommender Systems

Online continual learning requires models to learn from non‑stationary data streams while retaining prior knowledge. We identify an overlooked phenomenon—knowledge fragility—where correctly learned instances are rapidly forgotten after minor parameter updates. Our analysis attributes this fragility to a temporal–spatial dual mechanism: temporal instability, high-frequency parameter oscillations cause forgetting to outpace adaptation; and spatial vulnerability, fragile instances lie in sharp, high‑curvature regions of the loss landscape that are extremely sensitive to optimization noise. These insights motivate PDFK (Perturbing to Defend Fragile Knowledge), a unified framework that defends fragile knowledge along both dimensions. Temporally, we apply exponential moving averaging to smooth parameter evolution and stabilize long‑term memory. Spatially, we inject minimal structured perturbations with a consistency constraint to flatten sharp regions and enhance robustness. PDFK requires no task‑boundary annotations. Extensive experiments demonstrate that PDFK substantially improves knowledge retention and outperforms strong baselines under diverse and challenging continual learning settings.

Perturbing to Preserve: Defending Fragile Knowledge in Online Continual Learning

While current state-of-the-art Remote Sensing Change Detection (RSCD) methods can achieve impressive results on individual datasets, they become unreliable in unseen environments and imaging conditions, with performance metrics declining by as much as 60% to 80%. Simultaneously, variable environments and complex imaging conditions are the main characteristics of remote sensing data, calling for generalizable RSCD methods. To address this issue, we propose a novel RSCD method capable of domain generalization—CDDGNet. This method is based on causal decoupling theory, which progressively decouples invariant change features from variable domain features to extract generalizable characteristics. This enables a network trained on a single domain to accurately identify change regions in other domains. Specifically, firstly, the Causal Feature Adaptation Module is proposed to preliminarily decouple and simplify feature information during the encoding process by using wavelet transformation and feature energy spectralization methods. Secondly, the Causal Feature Fusion Module is presented to fully decouple features and aggregate significant change features during the decoding process through frequency domain processing and feature re-attention mechanisms. Thirdly, the Decoupling Effect Loss Function is proposed to optimize the process by evaluating the effectiveness of causal decoupling. Extensive experiments have shown that our model significantly outperforms existing methods across multiple groups of generalization tasks with varying levels of difficulty.

Causal Decoupling Domain Generalization for Remote Sensing Change Detection

With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model’s response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

SDA: Steering-Driven Distribution Alignment for Open LLMs Without Fine-Tuning

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.

RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation

Graph-based incomplete multi-view clustering algorithms have gathered much attention due to their impressive clustering performance. However, existing methods primarily leverage intra-view correlation from observed views, while ignoring the exploration of explicit compensation relationships between different views. Moreover, these methods need post-processing to get labels, and the separate steps lack negotiation, which may lead to sub-optimal solutions. To address these issues, we propose a Cross-view Anchor Graph Learning and Factorization (AGLF) method. AGLF develops an Anchor Graph Completion (AGC) framework that explicitly learn the missing subgraph structures. Instead of requiring post-processing, AGC directly produces soft labels. By establishing a third-order tensor of soft labels, it employs the tensor Schatten $p$-norm to enhance anchor graph learning and factorization. To significantly improve the quality of subgraph learning, AGLF incorporates compensation subgraphs from supplementary views into the AGC framework, enabling the construction of a better anchor graph for label learning. An optimization algorithm is devised to solve the objective function. Experimental results across various datasets demonstrate the effectiveness of our method.

Cross-view Anchor Graph Learning and Factorization for Incomplete Multi-view Clustering

Event cameras provide microsecond latency and high dynamic range, making them ideal for 3D perception tasks in traffic scenes with challenging lighting conditions. Yet existing methods often struggle to generalize to out-of-domain environments due to the limited availability of diverse training data. While synthetic data offers an easily accessible alternative, it introduces a significant sim-to-real gap, particularly in motion patterns. We tackle this challenge by introducing Motion-Adaptation Mamba (MA-Mamba), a dual-track framework that advances both architecture and data augmentation. At the architectural level, we introduce a lightweight Spatio-Temporal Association module that captures motion-induced appearance variations at arbitrary scales, and an Adaptive Memory Balancing module, built on the Mamba state-space framework, that adaptively filters memory updates to maintain stable scene context under diverse dynamics. At the data level, we design event-oriented augmentations that simulate varied motion patterns and apply priority-based masked sequence modeling to strengthen long-range spatio-temporal reasoning. Trained solely on synthetic data, MA-Mamba delivers substantial zero-shot gains on multiple real-world benchmarks, demonstrating strong robustness and generalizability.

Towards Robust Event-Based Depth Estimation: Bridging Synthetic and Real Domains with Motion Adaptation

The Minimum Consistent Subset (MCS) problem arises naturally in the context of supervised clustering and instance selection, both of which are critical in enabling scalable and interpretable learning on large datasets. In supervised clustering, one aims to infer a meaningful partitioning of data using a small labeled subset, where classification is typically performed via nearest neighbor rules. However, the sheer volume of training data in modern applications poses a significant computational challenge. The MCS problem formalizes this goal: given a labeled dataset $\mathcal{X}$ in a metric space, the task is to compute a smallest subset $S \subseteq \mathcal{X}$ such that every point in $\mathcal{X}$ shares its label with at least one of its nearest neighbors in $S$.

Recently, the MCS problem has been extended to $\textit{graph metrics}$, where distances are defined by shortest paths. Prior work has shown that MCS remains NP-hard even on simple graph classes like trees, though an algorithm with runtime $\mathcal{O}(2^{6c} \cdot n^6)$ is known for trees, where $c$ is the number of colors and $n$ the number of vertices. This raises the challenge of identifying graph classes that admit algorithms efficient in both $n$ and $c$.

In this work, we study the Minimum Consistent Subset problem on graphs, focusing on two well-established measures: the vertex cover number ($vc$) and the neighborhood diversity ($nd$). Specifically, we design efficient algorithms for graphs exhibiting small $vc$ or small $nd$, which frequently arise in real-world domains characterized by local sparsity or repetitive structure.
These parameters are particularly relevant because they capture structural properties that often correlate with the tractability of otherwise hard problems. Graphs with small vertex cover are "almost independent sets", representing sparse interactions, while graphs with small neighborhood diversity exhibit a high degree of symmetry and regularity. Importantly, small neighborhood diversity can occur even in dense graphs, a property frequently observed in domains such as social networks with modular communities or knowledge graphs with repeated relational patterns. Thus, algorithms designed to work efficiently for graphs with small neighborhood diversity are capable of efficiently solving MCS in complex settings where small vertex covers may not exist.

We develop an algorithm with running time $vc^{O(vc)}\cdot \text{Poly}(n,c)$, and another algorithm with runtime $nd^{O(nd)}\cdot \text{Poly}(n,c)$. In the language of parameterized complexity, this implies that MCS is fixed-parameter tractable (FPT) parameterized by the vertex cover number and the neighborhood diversity. Notably, our algorithms remain efficient for arbitrarily many colors, as their complexity is polynomially dependent on the number of colors.

Learning with Structure: Computing Consistent Subsets on Structurally-Regular Graphs

Aiming to overcome distribution shift and label sparsity that hinder cross-domain generalization of Graph Neural Networks, Unsupervised Graph Domain Adaptation (UGDA) transfers knowledge from a label-rich source to an unlabeled target graph. Yet in practice, strict privacy protocols often withhold the source graph entirely, reducing UGDA to the far more constrained Source-Free UGDA (SFUGDA) where only a pre-trained source GNN remains. Despite recent progress, existing source-free UGDA methods remain hampered by source-knowledge absence: deprived of source graphs, they lose the reference distribution needed to gauge domain shift and must lean on noisy target cues, incurring biased adaptation and catastrophic forgetting. To overcome this drawback, this paper devise SFGAR, a two-stage SFUGDA framework that first generates pseudo-source graphs to recover the source distribution encoded in a frozen pre-trained GNN, then adversarially aligns these synthetic graphs with the unlabeled target. Theoretical analysis shows that this proxy alignment tightly bounds the target-domain generalization error. Extensive experiments on public benchmarks validate the state-of-the-art performance of SFGAR.

Downloads

Next from AAAI 2026

Regression over Classification: Assessing Image Aesthetics via Multimodal Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES