Singapore

Due to their ability of follow natural language
instructions,
vision-language-action (VLA) models are increasingly preva-
lent in the embodied AI arena, following the widespread suc-
cess of their precursors—LLMs and VLMs. In this paper,
we discuss 10 principal milestones in the ongoing develop-
ment of VLA models—multimodality, reasoning, data, eval-
uation, cross-robkot action generalization, efficiency,
whole-
body coordination, safety, agents, and coordination with hu-
mans. Furthermore, we discuss the emerging trends of us-
ing spatial understanding, modeling world dynamics, post
training, and data synthesis—all aiming to reach these mile-
stones. Through these discussions, we hope to bring
attention
to the research avenues that may accelerate the development
of VLA models into wider acceptability.

AAAI 2026

10 Open Challenges Steering the Future of Vision-Language-Action Models

vision-language-action models

robotics

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. Code and benchmark will be released to the community.

LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm—the Self-Correcting Iterative Refinement (SCIR) framework—along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks— named entity recognition, relation extraction, and event extraction—achieving a 5.27\% average improvement in span-based Micro-F1 while reducing training costs by 87\% compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms. Our code is anonymously available at https://github.com/Sheehan-Fang/SCIR.

SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

Data economic efficiency (DE\textsuperscript{2}) drives AI by optimizing data usage, reducing costs, and enhancing efficiency. In 3D tumor segmentation, DE\textsuperscript{2} is crucial due to the high demand for labor-intensive manual annotations. Box-supervised segmentation offers a promising alternative but suffers from tumor morphology complexity and boundary ambiguity.
In this paper, we propose a novel 3D tumor segmentation model that integrates both positional and embedding features to facilitate inter-task collaboration. We introduce an Anatomical-Driven Class Activation Map to predefine the complex tumor morphology prior, which is further refined by our Geometric Pixel Co-embedding Learner. This learner utilizes contrastive learning to encode semantic information between center and edge pixels, enhancing pixel clustering and progressively refining tumor boundary segmentation in a coarse-to-fine manner.
Our approach outperforms existing box-supervised methods in segmentation performance, with extensive experiments on four tumor datasets demonstrating significant improvements in box-supervised image segmentation. This work provides a cost-effective and efficient solution for tumor segmentation, advancing the application of DE\textsuperscript{2} in medical imaging.

GeoCoBox: Box-supervised 3D Tumor Segmentation via Geometric Co-embedding

Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content – such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

Zero-shot classifier expansion aims to adapt existing model to new, unseen classes. It utilizes semantic descriptions of class attributes to learn a mapping from the semantic space to the classifier's weight space, without requiring new visual training data. However, the learning process for this mapping relies solely on the semantic-weight co-occurrence relationships observed on classes and lacks explicit modeling of inter-class differences, making it difficult for the model to capture the fundamental discriminative features required to define classification boundaries. To overcome this limitation, we reframe the problem from a causal perspective and introduce a novel framework driven by counterfactuals. Our method first generates factual descriptions alongside corresponding inter-class counterfactuals to pinpoint the causal attributes essential for classification, then refines these representations via a mutual purification process, and finally leverages a novel separation loss to explicitly push the factual and counterfactual classifier weights apart. This strategy compels the model to forge clearer and more discriminative classification boundaries. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms existing state-of-the-art methods.

Counterfactual-Driven Zero-Shot Classifier Expansion

Deep neural networks are increasingly vulnerable to physically deployable backdoor attacks, which manipulate real-world objects to induce targeted model failures. However, current physical backdoor attacks predominantly rely on perpetually visible triggers (e.g., glasses, stickers, mud) appended to target objects. These methods inevitably expose attack traces during the deployment phase, risking human suspicion prior to activation. In this paper, we propose a conditionally-visible physical backdoor attack, which can only be activated under specific optical conditions and thereby overcomes the risk of being detected after deployment and before the attack. Specifically, to ensure robust and reliable activation, we design irregular polygonal pattern as triggers to against across environmental variations (e.g., lighting, angles, and occlusions). Moreover, we introduce a dual-phase mechanism (dormant and activated) to enable stealthy deployment. Our trigger remains invisible and dormant under non-attack conditions, leaving no physical traces. It activates instantaneously under specific illumination, inducing the target model to perform the desired behavior. We conduct experiments on traffic sign recognition tasks to compare our attack with six digital backdoor attacks and seven physical attacks, and to evaluate its performance against ten potential defense methods. Extensive experimental results demonstrate the effectiveness, stealthiness, and robustness of our attack.

FRBAT: Conditionally-Visible Physical Backdoor Attack via Fluorescence

The proliferation of Large Language Models (LLMs) has established LLM routing as a standard service delivery mechanism, where users select models based on cost, Quality of Service (QoS), among other things.
However, optimal pricing in LLM routing platforms requires precise modeling for dynamic service markets, and solving this problem in real time at scale is computationally intractable.
In this paper, we propose \PriLLM, a novel practical and scalable solution for real-time dynamic pricing in competitive LLM routing.
\PriLLM models the service market as a Stackelberg game, where providers set prices and users select services based on multiple criteria.
To capture real-world market dynamics, we incorporate both objective factors (\eg~cost, QoS) and subjective user preferences into the model.
For scalability, we employ a deep aggregation network to learn provider abstraction that preserve user-side equilibrium behavior across pricing strategies.
Moreover, \PriLLM offers interpretability by explaining its pricing decisions.
Empirical evaluation on real-world data shows that \PriLLM achieves over 95\% of the optimal profit while only requiring less than 5\% of the optimal solution's computation time.

Pricing Online LLM Services with Data-Calibrated Stackelberg Routing Game

Unsupervised graph anomaly detection (GAD) has received increasing attention in recent years, which aims to identify data anomalous patterns utilizing only unlabeled node information from graph-structured data. However, prevailing unsupervised GAD methods typically presuppose complete node attributes and structure information, a condition hardly satisfied in real-world scenarios owing to privacy, collection errors or dynamic node arrivals. Existing standard imputation schemes risk ``repairing'' rare anomalous nodes so that they appear normal, thereby introducing imputation bias into the detection process. In addition, when both node attributes and edges are missing simultaneously, estimation errors in one view can contaminate the other, causing cross-view interference that further undermines the detection performance. To overcome these challenges, we propose M$^2$V-UGAD, a multiple missing values–resistant unsupervised GAD framework on incomplete graphs. Specifically, a dual-pathway encoder is first proposed to independently reconstruct missing node attributes and graph structure, thereby preventing errors in one view from propagating to the other. The two pathways are then fused and regularized in a joint latent space so that normals occupy a compact inner manifold while anomalies reside on an outer shell. Lastly, to mitigate imputation bias, we sample latent codes just outside the normal region and decode them into realistic node features and subgraphs, providing hard negative examples that sharpen the decision boundary. Experiments on seven public benchmarks demonstrate that M$^2$V-UGAD consistently outperforms existing unsupervised GAD methods across varying missing rates.

Towards Multiple Missing Values-resistant Unsupervised Graph Anomaly Detection

The primate visual cortex exhibits topographic organization, where functionally similar neurons are spatially clustered, a structure widely believed to enhance neural processing efficiency. While prior works have demonstrated that conventional deep ANNs can develop topographic representations, these models largely neglect crucial temporal dynamics. This oversight often leads to significant performance degradation in tasks like object recognition and compromises their biological fidelity. To address this, we leverage spiking neural networks (SNNs), which inherently capture spike-based temporal dynamics and offer enhanced biological plausibility. We propose a novel Spatio-Temporal Constraints (STC) loss function for topographic deep spiking neural networks (TDSNNs), successfully replicating the hierarchical spatial functional organization observed in the primate visual cortex from low-level sensory input to high-level abstract representations. Our results show that STC effectively generates representative topographic features across simulated visual cortical areas. While introducing topography typically leads to significant performance degradation in ANNs, our spiking architecture exhibits a remarkably small performance drop (No drop in ImageNet top-1 accuracy, compared to a 3\% drop observed in TopoNet, which is the best-performing topographic ANN so far) and outperforms topographic ANNs in brain-likeness. We also reveal that topographic organization facilitates efficient and stable temporal information processing via the spike mechanism in TDSNNs, contributing to model robustness. These findings suggest that TDSNNs offer a compelling balance between computational performance and brain-like features, providing not only a framework for interpreting neural science phenomena but also novel insights for designing more efficient and robust deep learning models.

TDSNNs: Competitive Topographic Deep Spiking Neural Networks for Visual Cortex Modeling

Sequential recommendation has garnered significant attention for its ability to capture dynamic preferences by mining users’ historical interaction data. Given that users’ complex and intertwined periodic preferences are difficult to disentangle in the time domain, recent research is exploring frequency domain analysis to identify these hidden patterns. However, current frequency-domain-based methods suffer from two key limitations: (i) They primarily employ static filters with fixed characteristics, overlooking the personalized nature of behavioral patterns; (ii) While the global discrete Fourier transform excels at modeling long-range dependencies, it can blur non-stationary signals and short-term fluctuations. To overcome these limitations, we propose a novel method called Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation (WEARec). Specifically, it consists of two vital modules: dynamic frequency-domain filtering and wavelet feature enhancement. The former is used to dynamically adjust filtering operations based on behavioral sequences to extract personalized global information, and the latter integrates wavelet transform to reconstruct sequences, enhancing blurred non-stationary signals and short-term fluctuations. Finally, these two modules work synergistically to achieve comprehensive performance and efficiency optimization in long sequential recommendation scenarios. Extensive experiments on four widely-used benchmark datasets demonstrate the superiority of WEARec.

Downloads

Next from AAAI 2026

LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads