Singapore

Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in complex urban environments using linguistic instructions. While successful navigation demands both global environmental reasoning and fine-grained scene comprehension, existing UAV agents typically rely on single-step planning paradigms that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which navigates in a coarse-to-fine manner. Specifically, HETT first predicts coarse-grained target positions using spatial landmarks and historical context, then refines actions through fine-grained visual analysis. Moreover, a historical grid map is designed to dynamically aggregate and organize visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experimental results demonstrate that HETT achieves state-of-the-art Success Rate (SR) on our refined CityNav dataset. The refined dataset and code will be released.

AAAI 2026

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

global environmental memory

two-stage transformer

urban embodied intelligence

aerial vision-and-language navigation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

A safe Pareto improvement (SPI) [41] is a modification of a game that leaves all players better off with certainty. 
SPIs are typically proven under qualitative assumptions about the way different games are played. 
For example, we assume that strictly dominated strategies can be iteratively removed and that isomorphic games are played isomorphically.
In this work, we study SPIs achieved through three types of \textit{ex post} verifiable commitments -- promises about player behavior from which deviations can be detected by observing the game. 
First, we consider disarmament -- commitments not to play certain actions. 
Next, we consider SPIs based on \textit{token games}. A token game is a game played by simply announcing an action (via cheap talk). As such, its outcome is intrinsically meaningless. However, we assume the players commit in advance to play specific (pure or correlated) strategy profiles in the original game as a function of the token game outcome. Under such commitments, the token game becomes a new, meaningful normal-form game.
Finally, we consider default-conditional commitment: SPIs in settings where the players' default ways of playing the original game can be credibly revealed and hence the players can commit to act as a function of this default. 
We characterize the complexity of deciding whether SPIs exist in all three settings, giving a mixture of characterizations and efficient algorithms and \NP- and \textsc{Graph Isomorphism}-hardness

Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments

A fundamental challenge in visual reinforcement learning (RL) is achieving robust generalization across environments with varying visual distractions. Current RL methods struggle with generalization due to their inability to differentiate foreground and background features during augmentation,while their Q-consistency mechanisms rely on outdated actions from replay buffers that drift from the current policy.In this paper, we present PQDA, a novel framework that addresses generalization challenges in RL through two key innovations: (1) Foreground-Background Decoupled Augmentation leverages Gaussian mixture model-based segmentation to efficiently generate and cache masks in replay buffers, applying differentiated augmentation strategies to foreground and background regions, thereby enhancing data diversity while maintaining task-relevant features. (2) Policy-Aligned Q-Consistency enforces policy alignment by sampling actions from the current policy for Q-regularization, achieving faster and more stable convergence. Notably, PQDA eliminates auxiliary tasks entirely through a unified architecture that co-optimizes the encoder and RL components directly. Extensive experiments on DMControl benchmarks (including our newly proposed CVDMC benchmark) and robotic manipulation tasks demonstrate PQDA's superior generalization performance, outperforming state-of-the-art methods.The code and new CVDMC benchmark will be released to facilitate reproducibility.

PQDA:Policy-Aligned Q-Consistency Meets Decoupled Augmentation for Generalizable Visual RL

Hypergraph contrastive learning has emerged as a powerful unsupervised paradigm for hypergraph representation learning. Traditional hypergraph contrastive learning methods typically leverage neighbor aggregation strategy to obtain entity (node and hyperedge) representations within each connected component, and then utilize contrastive losses (e.g., node- or hyperedge-level) to update the encoders. However, since entities are usually focused equally on their respective losses, large connected components with numerous entities tend to provide a dominant contribution to the whole learning process, which inevitably hinders the effective learning of entity representations within small connected components. To address this issue, we propose a novel Connected-Component-Aware Hypergraph Contrastive Learning method (CCAHCL). Different from previous methods that only construct node or hyperedge representations, our method additionally constructs the connected component representations, and accordingly designs a hierarchical contrastive loss to balance the model's focus on different scales of connected components. Specifically, we first use the traditional neighbor aggregation strategy to aggregate and update entity (node and hyperedge) representations. Then, these entity representations are further aggregated to generate the connected component representations, where entity features are incorporated into connected components and their structural information is propagated back to enrich their corresponding entities. Afterwards, we employ node-level and hyperedge-level losses to learn the enriched entity representations, and further propose a novel connected-component-level contrastive loss to balance the model's focus on all different connected components, naturally avoiding the learning bias on large connected components. Extensive experiments on various datasets demonstrate that our proposed model achieves superior performance against other state-of-the-art methods.

CCAHCL: Multi-Level Hypergraph Contrastive Learning for Connected Component Awareness

With the advancement of meteorological instruments, abundant data has become available. 
However, due to instruments’ intrinsic limitations such as environmental sensitivity and orbital constraints, raw data often suffer from temporal or spatial gaps, making it urgent to leverage data synthesis techniques to fill in missing information. 
Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. 
This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. 
To address above challenges, we introduce SynWeather, the first dataset designed for \textbf{Unified Multi-region and Multi-variable Weather Observation Data Synthesis}. 
SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. 
In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. 
Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models. 
Moreover, SynWeatherDiff is able to generate results that are both fine-grained and accurate in high-value regions.
Through the dataset and baseline model, we aim to advance meteorological downstream tasks and promote the development of general models for weather variable synthesis.

SynWeather: Weather Observation Data Synthesis Across Multiple Regions and Variables via a General Diffusion Transformer

Graph neural networks (GNNs) excel at processing non-Euclidean data, privacy concerns often hinder data sharing, leading to data isolation. Although federated learning (FL) provides a privacy-preserving framework for distributed GNN training, combining FL with GNNs presents challenges such as data heterogeneity and client resource constraints. This paper introduces DA-DFGAS: a federated graph neural network architecture search algorithm that minimises resource consumption while maintaining search flexibility through a directed tree topology and path constraint mechanisms. Additionally, it employs a self-attention joint aggregation mechanism based on predicted probability distribution to adjust the discrepancy between the client and global output distributions. A client selection strategy combined with a global-local two-level objective optimisation balances global consistency and local flexibility. Experimental results demonstrate that DA-DFGAS surpasses existing baseline methods across multiple datasets, achieving an accuracy enhancement of 0.5–3.0\% compared to graph learning baselines and 0.5–5.0\% compared to other federated graph learning baselines.

DA-DFGAS:Differentiable Federated Graph Neural Architecture Search with Distribution-Aware Attentive Aggregation

With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code will be released on GitHub upon acceptance to facilitate future research in 3DGS quality assessment.

Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric

Barrier certificates play an important role in verifying the safety of continuous-time systems, including autonomous driving, robotic manipulators and other critical applications. Recently, 
ReLU neural barrier certificates---barrier certificates represented by the ReLU neural networks---have attracted significant attention in the safe control community due to their promising performance.
However, because of the approximate nature of neural networks, rigorous verification methods are required to ensure the correctness of these certificates. This paper presents a necessary and sufficient condition for verifying the correctness of ReLU
neural barrier certificates. The proposed condition can be encoded as either an Satisfiability Modulo Theories (SMT) or optimization problem, enabling both verification and falsification. To the best of our knowledge, this is the first approach
capable of falsifying ReLU neural barrier certificates. Numerical experiments demonstrate the validity and 
effectiveness of the proposed method in both verifying and falsifying such certificates.

Efficient Verification and Falsification of ReLU Neural Barrier Certificates

We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a Cross-Attention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multi-modal analysis in sports.

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

Large Language Models (LLMs) are widely used in legal judgment prediction tasks, which aim to enhance judicial efficiency. However, the length of legal fact descriptions poses a significant challenge to the application of LLMs. Long inputs not only introduce noise, affecting output quality, but also increase processing time. While existing text compression methods, such as generating summaries or training models to implicitly reduce text dimensionality, can shorten input length, they often face the slow generation speeds and limited interpretability issues. To address these issues and inspired by information bottleneck-based text compression, we propose the Zipped Information Processor for Legal Judgment Prediction method, ZipLJP. By effectively integrating legal knowledge into the compression process, ZipLJP not only reduces input length but also improves processing efficiency and prediction quality. Experiments show that our approach achieves better performance compared to the previous methods on two widely used open-source and real-world datasets.

ZipLJP: Zipped Information Processor for Legal Judgment Prediction

Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding.

We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

Downloads

Next from AAAI 2026

Promises Made, Promises Kept: Safe Pareto Improvements via Ex Post Verifiable Commitments

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES