Singapore

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from 0.74 to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

AAAI 2026

Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

cv: remote sensing / geospatial ai

cv: large vision models

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges with quantum data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for $H_2$, $LiH$, $BeH_2$, and $H_4$, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.

Quantum Transformer for Molecular Learning: Multi-Configuration Ground-State Energy Prediction

Transfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the \underline{Do}main-\underline{g}uided \underline{Fi}ne-\underline{t}uning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity–diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., \textit{late-start} and \textit{cut-off}, which improve generation quality and training stability. Experiments \footnote{Code is provided in suppl. materials and will be made public.} on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and $\text{FD}_{\text{DINOV2}}$ while requiring up to 2× fewer sampling TFLOPS.

DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models

Generative models have become a powerful tool for synthesizing training data in computer vision tasks. Current approaches solely focus on aligning generated images with the target real dataset distributions. As a result, they only captured the common features in the real dataset and merely generated 'easy samples', which are already well-learned from real data. In contrast, those rare 'hard samples', with atypical features but crucial for enhancing performance, cannot be effectively generated. Consequently, these approaches must synthesize large volumes of data to yield appreciable performance gains, yet the upper bound remains limited. To overcome this limitation, we present a novel methodology that can learn to control the learning difficulty of samples during generation, in addition to domain alignment. Thus, it can efficiently generate valuable `hard samples' that yield significant performance improvements for target tasks. This is achieved by incorporating learning difficulty as a new condition in generative models with a designed encoder structure, training and generation strategy. Experimental results across multiple datasets show that our method can achieve higher performance with less generation cost. Specifically, we can get the best performance with only 10\% addtional synthetic data, saving 63.4 GPU hours of generation than previous SOTA on ImageNet. Moreover, our method also offers insightful visualizations of category-specific hard factors, serving as a tool for analyzing the datasets.

Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data

While instruction-based image editing is emerging, extending it to 360° panorama introduces additional challenges. Existing methods often produce implausible results in both equirectangular projections (ERP) and perspective views. To address these limitations, we propose SE360, a novel framework for multi-condition guided object editing in 360° panoramas. At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention. This pipeline leverages a Vision-Language Model (VLM) and adaptive projection adjustment for hierarchical analysis, ensuring the holistic segmentation of objects and their physical context. The resulting data pairs are both semantically meaningful and geometrically consistent, even when sourced from unlabeled panoramas. Furthermore, we introduce a cost-effective, two-stage data refinement strategy to improve data realism and mitigate model overfitting to erasing artifacts. Based on the constructed dataset, we train a Transformer-based diffusion model to allow flexible object editing guided by text, mask, or reference image in 360° panoramas. Our experiments demonstrate that our method outperforms existing methods in both visual quality and semantic accuracy.

SE360: Semantic Edit in 360° Panoramas via Hierarchical Data Construction

Representation learning serves as a foundational component of medical vision-language models (MVLMs), enabling cross-modal alignment, semantic consistency, and enhanced generalization capabilities for downstream tasks. As generalist models rapidly evolve, there is a pressing need to unify diverse downstream tasks, such as diagnosis, segmentation, report generation, and multiple choice within a cohesive framework, demanding more efficient and versatile visual representation learning. However, current MVLMs predominately follow CLIP-style vision pretraining, failing to leverage heterogeneous data resources with multi-dimensional imaging and diverse annotation forms. And there lacks systematic analysis of efficient vision encoder design across varied downstream applications, including diagnosis, segmentation, and text generation tasks, particularly for volumetric imaging like Computed Tomography (CT). Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. CTInstruct achieves SOTA performance across 8 CT benchmarks, setting a new standard for data-efficient multimodal learning in medical imaging.

Versatile Vision-Language Model for 3D Computed Tomography

Human Novel View Synthesis (HNVS) aims to synthesize photorealistic human images from novel viewpoints given observations from known views. Despite significant advances achieved by existing methods such as NeRF, diffusion models, and 3DGS, they still face substantial challenges in achieving stable modeling from a single image. In this paper, we introduce \textit{Dual-Constraint Human Gaussian Splatting (\textbf{DcSplat})}, a novel, simple, and efficient 3D Gaussian-based framework for single-view 3D human reconstruction. To address occlusion-induced texture missing and depth ambiguities, we introduce two key components: a Latent Multi-View Consistency Constraint Mechanism and a Geometric Constraint Module. The former employs a Latent-space Appearance Transformer (LatentFormer) to learn semantically coherent, view-consistent appearance priors via SMPL-guided pseudo-view fusion. The latter refines noisy SMPL-based depth through a U-Net-like structure conditioned on latent appearance features. These two modules are jointly optimized to generate high-quality Gaussian parameters in a unified latent space. Extensive experiments demonstrate that DcSplat outperforms existing SOTA methods in both geometry and texture quality, while achieving fast inference and lower computational cost.

DcSplat: Dual-Constraint Human Gaussian Splatting with Latent Multi-View Consistency

Large Language Models (LLMs) have demonstrated remarkable performance in code generation, offering new possibilities for translating natural language into executable programs. To further enhance LLMs’ code generation capabilities, Retrieval-Augmented Generation (RAG) has emerged as a promising strategy by retrieving code examples aligned with the generation intent to guide the process. However, existing RAG-based methods often suffer from unnecessary augmentation, preference misalignment, and surface-level mimicry, which undermine the effectiveness of retrieved examples in guiding LLMs toward accurate code generation. To address these challenges, we propose SRACG, a Selective Retrieval-Augmented Code Generation framework. SRACG begins with a necessity-aware selection mechanism to identify generation intents that genuinely require retrieval support, thereby avoiding degradation from indiscriminate augmentation. For intents identified as needing enhancement, it first employs a multi-objective retrieval strategy to select examples that are semantically aligned with the intent. These candidates are then further filtered by assessing their consistency with the LLM’s inherent generation preferences, ensuring alignment in both style and structure. Finally, it extracts execution plans from the filtered examples to uncover their underlying logic, guiding the LLM to better comprehend the examples instead of merely mimicking surface-level content. Experimental results on widely used benchmarks show that SRACG significantly improves the success rate of LLM-generated code and outperforms existing approaches. \footnote{The code is provided in the supplementary material.}

SRACG: A Code Generation Framework with Selective Retrieval Augmentation

Backdoor attacks pose a severe threat to federated graph learning (FGL), where malicious clients can inject hidden triggers into the global model without being detected. Defending against such attacks is particularly challenging due to the complex graph structures and the stealthy nature of trigger patterns. In this work, we propose MultiKD, a novel backdoor mitigation method based on attention-guided multi-teacher distillation. Unlike existing defenses that focus on detecting suspicious clients or restricting backdoor activation, MultiKD directly purifies the global model on the server side by exploiting intermediate representations. It integrates knowledge from multiple client models and guides the global model to suppress backdoor behaviors by aligning attention maps and preserving inter-layer relational consistency. Our defensive intuition enables MultiKD to retain task-relevant information while mitigating malicious patterns, even when some teacher models are compromised. Extensive experiments on four real-world datasets demonstrate the effectiveness of our approach in significantly reducing attack success rate ($\leq$ 8\%) with minimal impact on utility ($\leq$ 5\%).

MultiKD: Backdoor Defense in Federated Graph Learning via Attention-Guided Multi-Teacher Distillation

In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM's performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

Effectively modeling multimodal spatial omics data is critical for understanding tissue complexity and underlying biological mechanisms. While spatial transcriptomics, proteomics, and epigenomics capture molecular features, they lack pathological morphological context. Integrating these omics with histopathological images is thus critical for comprehensive disease tissue analysis. However, substantial heterogeneity across omics, imaging, and spatial modalities poses significant challenges. Naive fusion of semantically distinct sources often leads to ambiguous representations. Additionally, the resolution mismatch between high-resolution histology images and lower-resolution sequencing spots complicates spatial alignment. Biological perturbations during sample preparation further distort modality-specific signals, hindering accurate integration. To address these challenges, we propose Graph-guided Representation of Omics and Vision with Expert Regulation for Adaptive Spatial Multi-omics Fusion (GROVER), a novel framework for adaptive integration of spatial multi-omics data. GROVER leverages a Graph Convolutional Network encoder based on Kolmogorov–Arnold Networks to capture the nonlinear dependencies between each modality and its associated spatial structure, thereby producing expressive, modality-specific embeddings. To align these representations, we introduce a spot-feature-pair contrastive learning strategy that explicitly optimizes the correspondence across modalities at each spot. Furthermore, we design a dynamic expert routing mechanism that adaptively selects informative modalities for each spot while suppressing noisy or low-quality inputs. Experiments on real-world spatial omics datasets demonstrate that GROVER outperforms state-of-the-art baselines, providing a robust and reliable solution for multimodal integration.

Downloads

Next from AAAI 2026

Quantum Transformer for Molecular Learning: Multi-Configuration Ground-State Energy Prediction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Quantum Transformer for Molecular Learning: Multi-Configuration Ground-State Energy Prediction

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads