Singapore

While instruction-based image editing is emerging, extending it to 360° panorama introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360° panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erasing artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360° panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

AAAI 2026

SE360: Semantic Edit in 360° Panoramas via Hierarchical Data Construction

diffusion models for vision

multi-modal vision

image & video synthesis

computational photography

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Representation learning serves as a foundational component of medical vision-language models (MVLMs), enabling cross-modal alignment, semantic consistency, and enhanced generalization capabilities for downstream tasks. As generalist models rapidly evolve, there is a pressing need to unify diverse downstream tasks, such as diagnosis, segmentation, report generation, and multiple choice within a cohesive framework, demanding more efficient and versatile visual representation learning. However, current MVLMs predominately follow CLIP-style vision pretraining, failing to leverage heterogeneous data resources with multi-dimensional imaging and diverse annotation forms. And there lacks systematic analysis of efficient vision encoder design across varied downstream applications, including diagnosis, segmentation, and text generation tasks, particularly for volumetric imaging like Computed Tomography (CT). Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. CTInstruct achieves SOTA performance across 8 CT benchmarks, setting a new standard for data-efficient multimodal learning in medical imaging.

Versatile Vision-Language Model for 3D Computed Tomography

Human Novel View Synthesis (HNVS) aims to synthesize photorealistic human images from novel viewpoints given observations from known views. Despite significant advances achieved by existing methods such as NeRF, diffusion models, and 3DGS, they still face substantial challenges in achieving stable modeling from a single image. In this paper, we introduce \textit{Dual-Constraint Human Gaussian Splatting (\textbf{DcSplat})}, a novel, simple, and efficient 3D Gaussian-based framework for single-view 3D human reconstruction. To address occlusion-induced texture missing and depth ambiguities, we introduce two key components: a Latent Multi-View Consistency Constraint Mechanism and a Geometric Constraint Module. The former employs a Latent-space Appearance Transformer (LatentFormer) to learn semantically coherent, view-consistent appearance priors via SMPL-guided pseudo-view fusion. The latter refines noisy SMPL-based depth through a U-Net-like structure conditioned on latent appearance features. These two modules are jointly optimized to generate high-quality Gaussian parameters in a unified latent space. Extensive experiments demonstrate that DcSplat outperforms existing SOTA methods in both geometry and texture quality, while achieving fast inference and lower computational cost.

DcSplat: Dual-Constraint Human Gaussian Splatting with Latent Multi-View Consistency

Large Language Models (LLMs) have demonstrated remarkable performance in code generation, offering new possibilities for translating natural language into executable programs. To further enhance LLMs’ code generation capabilities, Retrieval-Augmented Generation (RAG) has emerged as a promising strategy by retrieving code examples aligned with the generation intent to guide the process. However, existing RAG-based methods often suffer from unnecessary augmentation, preference misalignment, and surface-level mimicry, which undermine the effectiveness of retrieved examples in guiding LLMs toward accurate code generation. To address these challenges, we propose SRACG, a Selective Retrieval-Augmented Code Generation framework. SRACG begins with a necessity-aware selection mechanism to identify generation intents that genuinely require retrieval support, thereby avoiding degradation from indiscriminate augmentation. For intents identified as needing enhancement, it first employs a multi-objective retrieval strategy to select examples that are semantically aligned with the intent. These candidates are then further filtered by assessing their consistency with the LLM’s inherent generation preferences, ensuring alignment in both style and structure. Finally, it extracts execution plans from the filtered examples to uncover their underlying logic, guiding the LLM to better comprehend the examples instead of merely mimicking surface-level content. Experimental results on widely used benchmarks show that SRACG significantly improves the success rate of LLM-generated code and outperforms existing approaches. \footnote{The code is provided in the supplementary material.}

SRACG: A Code Generation Framework with Selective Retrieval Augmentation

Backdoor attacks pose a severe threat to federated graph learning (FGL), where malicious clients can inject hidden triggers into the global model without being detected. Defending against such attacks is particularly challenging due to the complex graph structures and the stealthy nature of trigger patterns. In this work, we propose MultiKD, a novel backdoor mitigation method based on attention-guided multi-teacher distillation. Unlike existing defenses that focus on detecting suspicious clients or restricting backdoor activation, MultiKD directly purifies the global model on the server side by exploiting intermediate representations. It integrates knowledge from multiple client models and guides the global model to suppress backdoor behaviors by aligning attention maps and preserving inter-layer relational consistency. Our defensive intuition enables MultiKD to retain task-relevant information while mitigating malicious patterns, even when some teacher models are compromised. Extensive experiments on four real-world datasets demonstrate the effectiveness of our approach in significantly reducing attack success rate ($\leq$ 8\%) with minimal impact on utility ($\leq$ 5\%).

MultiKD: Backdoor Defense in Federated Graph Learning via Attention-Guided Multi-Teacher Distillation

In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM's performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is thus critical for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov–Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.

GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion

In-context learning (ICL) enhances large language models (LLMs) by incorporating demonstration examples, yet its effectiveness heavily depends on the quality of selected examples. Current methods typically use text embeddings to measure semantic similarity, which often introduces bias in multi-step reasoning tasks. This occurs because text embeddings contain irrelevant semantic information and lack deeper reasoning structures. To address this, we propose **GraphIC**, a graph-based retrieval model that leverages reasoning-aware representation and specialized similarity metric for in-context example retrieval. GraphIC first constructs *thought graphs*—directed, node-attributed graphs that explicitly model reasoning steps and their dependencies—for candidate examples and queries. This approach filters out superficial semantics while preserving essential reasoning processes. Next, GraphIC retrieves examples using a novel similarity metric tailored for these graphs, capturing sequential reasoning patterns and asymmetry between examples. Comprehensive evaluations across mathematical reasoning, code generation, and logical reasoning tasks demonstrate that GraphIC outperforms 10 baseline methods. Our results highlight the importance of reasoning-aware retrieval in ICL, offering a robust solution for enhancing LLM performance in multi-step reasoning scenarios.

GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning

Generative recommendation as a new paradigm is influencing the current development of recommender systems. It aims to assign identifiers that capture richer semantic and collaborative information to items, and subsequently predict item identifiers via autoregressive generation using Large Language Models (LLMs). Existing approaches primarily tokenize item text into codebooks with preserved semantic IDs through RQ-VAE, or separately tokenize different modality features of items. However, existing tokenization methods face two major challenges: $\textbf{(1)}$ Learning decoupled multi-modal features limits the quality of the semantic representation. $\textbf{(2)}$ Ignoring collaborative signals from interaction history limits the comprehensiveness of identifiers. To address these limitations, we propose a $\underline{\textbf{mu}}$lti-modal $\underline{\textbf{s}}$emantic-enhanced $\underline{\textbf{i}}$dentifier with $\underline{\textbf{c}}$ollaborative signals for generative $\underline{\textbf{rec}}$ommendation, named MusicRec. In MusicRec, we propose a tokenization approach based on shared-specific modal fusion, enabling the generated identifiers to preserve semantic information more comprehensively from all modalities. In addition, we incorporate collaborative signals from user interactions to guide identifier generation, preserving collaborative patterns in the semantic representation space. Extensive experiments on three public datasets demonstrate that MusicRec achieves state-of-the-art performance compared to existing baseline methods.

MusicRec: Multi-modal Semantic-Enhanced Identifier with Collaborative Signals for Generative Recommendation

While much research has recently focused on generating physics-based adversarial samples, a critical yet often overlooked category originates from physical failures within on-board cameras—components essential to the perception systems of autonomous vehicles. Camera failures, whether due to external stresses causing hardware breakdown or internal component faults, can directly jeopardize the safety and reliability of autonomous driving systems. Firstly, we motivate the study using two separate real-world experiments to showcase that indeed glass failures would cause the detection based neural network models to fail. Secondly, we develop a simulation-based study using the physical process of the glass breakage to create perturbed scenarios, representing a realistic class of physics-based adversarial samples. Using a finite element model (FEM)-based approach, we generate surface cracks on the camera image by applying a stress field defined by particles within a triangular mesh. Lastly, we use physically-based rendering (PBR) techniques to provide realistic visualizations of these physically plausible fractures. To assess the safety implications, we apply the simulated broken glass effects as image filters to two autonomous driving datasets- KITTI and BDD100K- as well as the large-scale image detection dataset MS-COCO. We then evaluate detection failure rates for critical object classes using CNN-based object detection models (YOLOv8 and Faster R-CNN) and a transformer-based architecture with Pyramid Vision Transformers. To further investigate the distributional impact of these visual distortions, we compute the Kullback-Leibler (K-L) divergence between three distinct data distributions, applying various broken glass filters to a custom dataset (captured through a cracked windshield), as well as the KITTI and Kaggle cats and dogs datasets. The K-L divergence analysis suggests that these broken glass filters do not introduce significant distributional shifts. Our goal is to provide a robust, physics-based methodology for generating adversarial samples that reflect real-world camera failures, with the overarching aim of improving the resilience and safety of autonomous driving systems against such physical threats.

Fractured Glass, Failing Cameras: Simulating Physics-Based Adversarial Samples for Autonomous Driving Systems

Model-based reinforcement learning (MBRL) is a crucial approach to enhance the generalization capabilities and improve the sample efficiency of RL algorithms. However, current MBRL methods focus primarily on building world models for single tasks and rarely address generalization across different scenarios. Building on the insight that dynamics within the same simulation engine share inherent properties, we attempt to construct a unified world model capable of generalizing across different scenarios, named Meta-Regularized Contextual World-Model (MrCoM). This method first decomposes the latent state space into various components based on the dynamic characteristics, thereby enhancing the accuracy of world-model prediction. Further, MrCoM adopts meta-state regularization to extract unified representation of scenario-relevant information, and meta-value regularization to align world-model optimization with policy learning across diverse scenario objectives. We theoretically analyze the generalization error upper bound of MrCoM in multi-scenario settings.
We systematically evaluate our algorithm's generalization ability across diverse scenarios, demonstrating significantly better performance than previous state-of-the-art methods.

Downloads

Next from AAAI 2026

Versatile Vision-Language Model for 3D Computed Tomography

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Versatile Vision-Language Model for 3D Computed Tomography

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads