Singapore

Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge.
We address this through a dual innovation in data infrastructure and model architecture.
First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m$\times$600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames.
Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity.
Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation:
1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 
2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently.
Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation.
The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

AAAI 2026

EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

diffusion models for vision

vision for robotics & autonomous driving

3d computer vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision–Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose **L2V-CoT**, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. **L2V-CoT** extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation. Code will be made available at supplementary materials.

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

Cheating Stereo Matching in Full-Scale: Physical Adversarial Attack Against Binocular Depth Estimation in Autonomous Driving

In this paper, we investigate the application of heuristics based on Graph Neural Networks (GNNs) to lifted numeric
planning problems, an area that has been relatively unexplored. Building upon the GNN approach for learning general
policies proposed by Staahlberg et al., we extend the architecture to make it sensitive to the numeric components inherent in the planning problems we address. We achieve this by observing that, although the state space of a numeric planning problem is infinite, the finite subgoal structure of the problem can be incorporated into the architecture, enabling the construction of a finite structure. Instead of learning general policies, we train our models to serve as heuristics within a best-first search algorithm. We explore various configurations of this architecture and demonstrate that the resulting heuristics are highly informative and, in certain domains, offer a better trade-off between guidance and computational cost compared to state-of-the-art heuristics.

Learning Heuristic Functions with Graph Neural Networks for Numeric Planning

Theory of Mind (ToM) refers to the ability to infer others' mental states, which is an essential capability for embodied AI agents to effectively collaborate and interact with humans. While improving Large Language Models' ability to reason about characters' mental states in text-based stories/dialogues has been extensively studied, enhancing Multimodal Large Language Models' ToM capabilities, particularly in egocentric video from an embodied perspective, remains unexplored. In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user's mental states (goals, beliefs, and potential next actions). Evaluation results on in-domain and out-of-domain demonstrate that our method achieves performance improvements of (+30.00\%, +2.00\%) and (+5.83\%, +5.00\%) compared to the backbone model and vanilla Group Relative Preference Optimization (GRPO) model, respectively. Additionally, we compare the performance of two post-training paradigms (Supervise Fine-Tuning and RL) and systematically analyze the reasoning trajectories across the base model, vanilla GRPO model, and our proposed method.

Reality vs Counterfactual: Multi-World Contrastive Reinforcement Learning for Enhancing MLLM’s Theory of Mind in Egocentric Videos

Knockout tournaments are a widely used competition format in sports, elections, and decision-making processes. In such tournaments, players compete in successive rounds, with losers eliminated and winners advancing until a single champion remains. Given a tournament digraph $D$, which encodes the outcomes of all possible matches, and a designated player $v^* \in V(D)$, the Tournament Fixing problem (TFP) asks whether the tournament can be scheduled in a way that guarantees $v^\*$ emerges as the winner. TFP is known to be NP-hard in general (AAAI'14), but is _fixed-parameter tractable_ (FPT) when parameterized by structural measures such as the feedback arc set (fas) or feedback vertex set (fvs) number of the tournament digraph (AAAI'17; IJCAI'18; AAAI'23). In this paper, we introduce and study two new structural parameters: the number of players who can defeat $v^\*$ (i.e., the in-degree of $v^\*$, denoted $d^+$) and the number of players that $v^\*$ can defeat (i.e., the out-degree of $v^\*$, denoted $d^-$). These parameters are motivated by the observation that when either the in-degree or out-degree is zero, the problem becomes trivial. This leads to a natural question: can TFP be efficiently solved when $d^+$ or $d^-$ is small? We answer this question affirmatively by showing that TFP is FPT when parameterized by either the in-degree or out-degree of $v^*$. Our algorithm for the in-degree parameterization is particularly involved and technically intricate. Notably, the in-degree $d^+$ can remain small even when other structural parameters such as fas or fvs are large. Hence, our results offer a new perspective and significantly broaden the parameterized algorithmic understanding of the Tournament Fixing problem.

How Hard Is It to Rig a Tournament When Few Players Can Beat or Be Beaten by the Favorite?

Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. While these models are now actively deployed in operational settings supporting regulators, certification bodies, and companies, they remain constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel deployed data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By quantifying the marginal utility of individual samples using Shapley values, our method guides strategic, cost-effective, and robust sampling campaigns within active monitoring programs. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. Our framework has been implemented and validated in a live provenance verification system currently used by enforcement agencies, demonstrating tangible, real-world impact. Through extensive experiments and deployment in a live provenance verification system, we show that this system significantly enhances provenance verification, mitigates fraudulent trade practices, and strengthens regulatory enforcement of global supply chains.

Optimizing Product Provenance Verification Using Data Valuation Methods

In emerging clinical applications such as ultrasound-based burn assessment, the lack of domain-specific data presents a significant challenge for developing robust AI systems. Vision-language models (VLMs) have shown strong performance in general computer vision tasks, yet their application to medical imaging remains limited, particularly due to insufficient reasoning capabilities and the scarcity of high-quality training data. We introduce AURA (Automated Unified Reasoning for Burn Assessment), a multi-modal approach that integrates pre-trained VLMs with symbolic first-order logic (FOL) reasoning to improve diagnostic accuracy and interpretability in this data-limited setting. For this study, we collected real-patient data over a one-year period at a U.S. burn center, performing all experiments in a real clinical setting to ensure practical relevance. The dataset includes both conventional B-Mode ultrasound and Tissue Doppler Imaging (TDI), with TDI introduced here for the first time in burn assessment, underscoring the emerging nature of this work. Beyond burn severity classification, we assess the system’s ability to produce expert-level surgical insight directly from imaging data. On the retrospective dataset, it achieves up to 93% accuracy in surgical classification and 87% in fine-grained burn depth prediction, comparable to expert-informed predictions and substantially exceeding the 70% accuracy of traditional visual inspection by human experts. These results, obtained from a novel multi-modal dataset collected in a real clinical burn center setting, highlight the potential of this approach to improve decision-making in burn care. To further support future deployment, we demonstrate a prototype integration with an Electronic Medical Record (EMR) system that aligns with clinical workflows and supports scalable, real-world implementation.

Automated Unified Reasoning with Vision-Language Models for Multi-modal Burn Assessment

Recent advances in machine learning have driven a
step-change in robot perception with modalities such as
vision, where large amounts of training data are readily
available or cheap to collect. However, in tactile
perception, the relatively high cost of data collection
still largely impedes the adoption of such data-driven
learning solutions. In this article, we introduce TactGen,
a novel, cross-modal framework to tackle this challenge. In
particular, using a two-step data generation pipeline, we
leverage easily accessible vision data to synthesise
artificial tactile data for downstream classifier training.
Specifically, we use readily collected video data of
objects of interest to efficiently learn neural radiance
field (NeRF) representations. The NeRF models are then used
to render red–green–blue-depth (RGBD) images from any
desired vantage points. In the second stage, the RGBD
images are translated into corresponding tactile images
typically generated by camera-based tactile sensors using a
conditional generative adversarial network (cGAN). The cGAN
model is itself trained with a large set of visuo-tactile
images collected in simulation, and can be transferred into
the real world without fine-tuning or additional data
collection. We extensively validate this approach in the
context of tactile object classification, showing that it
effectively reduces data collection time by a factor of 20
while achieving similar performance to training a
classifier on manually collected real data.

Downloads

Next from AAAI 2026

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads