Singapore

Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zerosupervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10–13 F1 points and strong LLM fine-tunes by 5–8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence.CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

AAAI 2026

CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

nlp: fact-checking / misinformation detection (nlp focus)

nlp: conversational ai/dialog systems

nlp: (large) language models

nlp: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Integrating Ordinary Differential Equations (ODEs) with U-shaped neural networks has emerged as a novel direction in medical image segmentation. Current networks predominantly employ discretization methods incorporating ODEs. However, these methods face inherent trade-offs between model compactness, computational accuracy, and efficiency. Continuous ODE solutions were rarely studied because they face three limitations: high computational costs, long training time, and poor generalization ability. To address these limitations, we propose an innovative Continuous Neural Memory ODE UNet (CNM-UNet), which replaces all hierarchical decoder layers in vanilla UNet with a single Continuous Neural Memory ODEs Block (CNM-Block) decoder, significantly reducing computation costs and improving training efficiency. CNM-UNet leverages ODEs' dynamic properties to establish continuous temporal feature extraction. For alleviating the generalization problem, a DUal SElf-updated (DUSE) strategy based on test-time adaptation principles is introduced to enhance cross-domain generalization. Experimental results demonstrate CNM-UNet's comprehensive advantages in computational capacity, convergence speed, and cross-domain adaptability, offering new insights for practical deployment of continuous ODE methodologies for medical image segmentation.

CNM-UNet: Continuous Ordinary Differential Equations for Medical Image Segmentation

Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a dynamic test-time scaling law strategy inspired by imagery that adaptively adjusts the inference search space and reward guided by prompts, effectively enhancing generation quality in imaginative scenarios. Furthermore, we introduce LDT-Bench, the first benchmark targeting long-distance semantic prompts, designed to evaluate the creativity of video generation models. It comprises 2,839 challenging concept pairs from diverse recognition datasets and incorporates an automatic evaluation protocol to assess creative capacity. Extensive experiments on LDT-Bench demonstrate that our approach consistently outperforms general generation models and test-time scaling approaches. Additionally, ImagerySearch achieves strong performance on VBench, confirming its effectiveness in improving video generation quality under diverse conditions. We will release LDT-Bench and codes.

ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Graph-based clustering algorithms aim to construct an affinity graph that accurately captures the intrinsic structure of a dataset. To achieve this goal, these algorithms often use the k-nearest-neighbor (k-nn) method to build a graph regularizer for the required affinity graph, enabling it to have a grouping effect. However, due to the complex nature of real-world data, the k-nn method often fails to capture the true neighborhood relationships of a dataset, which in turn limits the quality of the learned affinity graph. Motivated by the insight that a learned affinity graph itself can more effectively reflect the underlying data structure, we propose a new graph-based clustering framework, termed Self-learned Graph Regression (SGR). Unlike traditional approaches, SGR constructs its graph regularizer directly from the affinity graph being learned, allowing the graph to adaptively capture more accurate structural information. To solve the proposed problem, we develop an optimization algorithm along with an acceleration strategy. We further analyze the convergence and computational complexity of the proposed algorithm. Extensive clustering experiments on various benchmark datasets demonstrate that our method outperforms the state-of-the-art graph-based clustering algorithms. The codes of SGR are available at \url{https://github.com/weilyshmtu/SGR}.

Clustering with Self-Learned Graph Regression

Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, an effective low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. All models and results will be made public.

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Egocentric point tracking aims to localize points on object surfaces from a first-person perspective and serves as a critical step toward embodied intelligence. Recent methods rely on video input, tracking query points through feature matching across consecutive frames. 
However, these methods struggle in highly dynamic settings—a common challenge in first-person perspectives, where the head-mounted camera undergoes frequent and abrupt rotations, resulting in high angular velocities, motion blur, and large inter-frame displacements.
In contrast, event cameras capture motion at microsecond temporal resolution, naturally avoiding blur and delivering low-latency, high-fidelity cues crucial for egocentric point tracking.
Moreover, rapid egocentric motion disrupts local smoothness, breaking the assumption that spatially adjacent regions share similar motion. Event dynamics expose global motion trends, guiding coherent modeling and consistent feature flow.
Therefore, this paper proposes a mamba-based tracking framework that constructs feature modeling paths aligned with the dominant motion trend extracted from events, and modulates feature propagation along these paths based on local motion intensity, enhancing stability by suppressing unreliable signals and emphasizing consistent cues.
Additionally, a motion-adaptive suppression module enhances temporal robustness by adaptively suppressing correlation features based on motion intensity variations, mitigating the effects of intensity fluctuations and partial observability.
To facilitate research in this domain, a multimodal dataset named DVS-EgoPoints with both events and videos for egocentric point tracking is collected. Experiments on the DVS-EgoPoints dataset and a simulation benchmark demonstrate superior performance over state-of-the-art methods, especially under challenging motion and occlusion conditions.

E-MaT:Event-oriented Mamba for Egocentric Point Tracking

Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments.

Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing

How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. 
To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. 
To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. 
Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning.
We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.

IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments

While recent 3D head avatar creation methods attempt to animate facial dynamics, they often fail to capture personalized details, limiting realism and expressiveness.
To fill this gap, we present DipGuava (Disentangled and Personalized Gaussian UV Avatar), a novel 3D Gaussian head avatar creation method that successfully generates avatars with personalized attributes from monocular video.
DipGuava is the first method to explicitly disentangle facial appearance into two complementary components, trained in a structured two-stage pipeline that significantly reduces learning ambiguity and enhances reconstruction fidelity.
In the first stage, we learn a stable geometry-driven base appearance that captures global facial structure and coarse expression-dependent variations.
In the second stage, the personalized residual details not captured in the first stage are predicted, including high-frequency components and nonlinearly varying features such as wrinkles and subtle skin deformations.
These components are fused via dynamic appearance fusion that integrates residual details after deformation, ensuring spatial and semantic alignment.
This disentangled design enables DipGuava to generate photorealistic, identity-preserving avatars, consistently outperforming prior methods in both visual quality and quantitative performance, as demonstrated in extensive experiments.

DipGuava: Disentangling Personalized Gaussian Features for 3D Head Avatars from Monocular Video

Weakly supervised phrase localization (WSPL) aims to localize visual objects mentioned by given phrases, but learning without human-annotated bounding boxes. Previous works struggle in multi-object scenarios, where objects in the background often simultaneously appear with the target objects. To this end, we propose a Diffusion-Assisted PrOgressive learning framework (i.e., DAPO) for WSPL task in this paper.
Specifically, we score the difficulty of training samples based on the quantity of objects and the level of semantic alignment. These samples are then incorporated progressively during training, in an order by their difficulty scores. To address the sample imbalance problem, we propose a Generation-Assisted Tuning (GAT) method for the grounding network. First, to enrich the samples from few-object scenarios, we leverage Stable Diffusion (SD) to generate images with phrases. Second, we introduce an attention-driven scheme to guide SD's attention on mentioned objects. Finally, we design a diffusion-guided loss, which helps the grounding network learn the objects' layouts. Extensive experiments show that our DAPO framework outperforms the strong baselines on benchmark datasets. The source code will be publicly available on GitHub after the double-blind phase.

Content not yet available

Next from AAAI 2026

CNM-UNet: Continuous Ordinary Differential Equations for Medical Image Segmentation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES