Singapore

Motivated by applications in forecasting, we study chronological reasoning in LLMs. We test LLMs’ ability to understand and enforce chronological order in three types of tasks : sorting randomly shuffled historical events; conditional sorting of events defined by some conditions; and anachronism detection based on intersections of multiple timelines. Our experiments use events that we first confirm are known to the LLM; this ensures that we test chronological understanding on an LLM’s pretrained internal knowledge. Across three LLM families— GPT-4.1 (standard), GPT-5 (hybrid-reasoning), and Claude 3.7 Sonnet (large-reasoning, with and without Extended Thinking), we find that performance degrades rapidly with problem complexity but improves greatly for reasoning models with test-time extended reasoning.

AAAI 2026

Do Large Language Models (LLMs) Understand Chronology? (Student Abstract)

nlp: (large) language models

and temporal reasoning

krr: geometric

ml: time-series/data streams

spatial

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Proactive dialogue systems, which are designed to guide
conversations toward predetermined goals. However,
contemporary LLMs predominantly function as passive
assistants, mechanically executing human instructions. A
key challenge contributing to this limitation is the
inherent difficulty in acquiring and annotating
high-quality training data for proactive dialogue.
Consequently, the scarcity of such data results in a
notable deficiency in the proactive conversational
capabilities of current LLMs.In this paper, we introduce
PANDA (Proactive Agent-based Negotiation Dialogue
Augmentation), a method designed to generate accurate,
complex, and diverse proactive dialogue data for a
challenging task—financial dispute mediation—where a LLM
acts as the mediator. PANDA leverages a novel self-evolving
synthesis process to manage a pool of user profiles and
generate dialogues through structured interactions between
multiple LLM-driven agents. To ensure data fidelity, we
propose a comprehensive evaluation framework and build a
two-level validation system combining automated and expert
human verification. Our experiments demonstrate that an
8B-parameter model, trained on our synthesized dataset,
achieves state-of-the-art results in the task's evaluation
framework. Its performance rivals top closed-source models
guided by heavily engineered prompts, even when provided
with only essential information.

PANDA: Empowering Small Language Models for Proactive Dialogue Through Agent-Based Synthesis (Student Abstract)

Adversarial training is an effective technique for
enhancing the robustness of deep neural networks (DNNs).
Prior research shows that misclassified examples influence
final adversarial robustness much more than correctly
classified examples. Ignoring this difference during
training can hurt model performance. In crowdsourcing,
varying annotator expertise causes noisy, inconsistent
labels. As a result, it is hard to distinguish
misclassified and correctly classified examples using only
provided annotations. Thus, how to use the reliability and
discrepancy between these example types to improve
robustness within adversarial learning remains a critical
but underexplored issue.
In this work, we first explore how misclassified and
correctly classified examples affect learning from crowds
(LFC) in adversarial environments. Then, we formulate the
problem of misclassification-aware robust learning from
multiple human labelers as a bilevel min-max problem. After
that, we introduce MALC, a new approach to make classifiers
more robust to adversarial examples via iterative
adversarial example generation and parameter estimation. We
conduct an extensive evaluation of the proposed MALC,
showing that MALC can outperform the state-of-the-art LFC
methods in both white-box and black-box settings.

Misclassification-Aware Robust Learning from Multiple Human Labelers (Student Abstract)

Due to their ability of follow natural language
instructions,
vision-language-action (VLA) models are increasingly preva-
lent in the embodied AI arena, following the widespread suc-
cess of their precursors—LLMs and VLMs. In this paper,
we discuss 10 principal milestones in the ongoing develop-
ment of VLA models—multimodality, reasoning, data, eval-
uation, cross-robkot action generalization, efficiency,
whole-
body coordination, safety, agents, and coordination with hu-
mans. Furthermore, we discuss the emerging trends of us-
ing spatial understanding, modeling world dynamics, post
training, and data synthesis—all aiming to reach these mile-
stones. Through these discussions, we hope to bring
attention
to the research avenues that may accelerate the development
of VLA models into wider acceptability.

10 Open Challenges Steering the Future of Vision-Language-Action Models

Wildfire risk prediction remains a critical yet challenging task due to the complex interactions among fuel conditions, meteorology, topography, and human activity. Despite growing interest in data-driven approaches, publicly available benchmark datasets that support long-term temporal modeling, large-scale spatial coverage, and multi-source drivers remain scarce. To address this gap, we present a 25-year, daily-resolution wildfire dataset covering 39 million hectares across British Columbia and surrounding regions. The dataset includes 39 covariates, encompassing active fire detections, weather variables, fuel conditions, terrain features, and anthropogenic factors. Using this benchmark, we evaluate a diverse set of time-series forecasting models—including CNN-based, linear-based, Transformer-based, and Mamba-based architectures—and systematically investigate key challenges such as the role of long temporal context, effectiveness of multi-task learning, and the relative importance of different fire-driving factors. All data and code are made publicly available in the supplementary material to facilitate future research in both climate-related hazard prediction and time series forecasting.

BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. Code and benchmark will be released to the community.

LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm—the Self-Correcting Iterative Refinement (SCIR) framework—along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks— named entity recognition, relation extraction, and event extraction—achieving a 5.27\% average improvement in span-based Micro-F1 while reducing training costs by 87\% compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms. Our code is anonymously available at https://github.com/Sheehan-Fang/SCIR.

SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on Schema

Data economic efficiency (DE\textsuperscript{2}) drives AI by optimizing data usage, reducing costs, and enhancing efficiency. In 3D tumor segmentation, DE\textsuperscript{2} is crucial due to the high demand for labor-intensive manual annotations. Box-supervised segmentation offers a promising alternative but suffers from tumor morphology complexity and boundary ambiguity.
In this paper, we propose a novel 3D tumor segmentation model that integrates both positional and embedding features to facilitate inter-task collaboration. We introduce an Anatomical-Driven Class Activation Map to predefine the complex tumor morphology prior, which is further refined by our Geometric Pixel Co-embedding Learner. This learner utilizes contrastive learning to encode semantic information between center and edge pixels, enhancing pixel clustering and progressively refining tumor boundary segmentation in a coarse-to-fine manner.
Our approach outperforms existing box-supervised methods in segmentation performance, with extensive experiments on four tumor datasets demonstrating significant improvements in box-supervised image segmentation. This work provides a cost-effective and efficient solution for tumor segmentation, advancing the application of DE\textsuperscript{2} in medical imaging.

GeoCoBox: Box-supervised 3D Tumor Segmentation via Geometric Co-embedding

Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content – such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

Zero-shot classifier expansion aims to adapt existing model to new, unseen classes. It utilizes semantic descriptions of class attributes to learn a mapping from the semantic space to the classifier's weight space, without requiring new visual training data. However, the learning process for this mapping relies solely on the semantic-weight co-occurrence relationships observed on classes and lacks explicit modeling of inter-class differences, making it difficult for the model to capture the fundamental discriminative features required to define classification boundaries. To overcome this limitation, we reframe the problem from a causal perspective and introduce a novel framework driven by counterfactuals. Our method first generates factual descriptions alongside corresponding inter-class counterfactuals to pinpoint the causal attributes essential for classification, then refines these representations via a mutual purification process, and finally leverages a novel separation loss to explicitly push the factual and counterfactual classifier weights apart. This strategy compels the model to forge clearer and more discriminative classification boundaries. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms existing state-of-the-art methods.

Counterfactual-Driven Zero-Shot Classifier Expansion

Deep neural networks are increasingly vulnerable to physically deployable backdoor attacks, which manipulate real-world objects to induce targeted model failures. However, current physical backdoor attacks predominantly rely on perpetually visible triggers (e.g., glasses, stickers, mud) appended to target objects. These methods inevitably expose attack traces during the deployment phase, risking human suspicion prior to activation. In this paper, we propose a conditionally-visible physical backdoor attack, which can only be activated under specific optical conditions and thereby overcomes the risk of being detected after deployment and before the attack. Specifically, to ensure robust and reliable activation, we design irregular polygonal pattern as triggers to against across environmental variations (e.g., lighting, angles, and occlusions). Moreover, we introduce a dual-phase mechanism (dormant and activated) to enable stealthy deployment. Our trigger remains invisible and dormant under non-attack conditions, leaving no physical traces. It activates instantaneously under specific illumination, inducing the target model to perform the desired behavior. We conduct experiments on traffic sign recognition tasks to compare our attack with six digital backdoor attacks and seven physical attacks, and to evaluate its performance against ten potential defense methods. Extensive experimental results demonstrate the effectiveness, stealthiness, and robustness of our attack.

Downloads

Next from AAAI 2026

PANDA: Empowering Small Language Models for Proactive Dialogue Through Agent-Based Synthesis (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES