Singapore

Wide-angle cameras, despite their popularity for con-
tent creation, suffer from distortion-induced facial stretch-
ing—especially at the edge of the lens—which degrades vi-
sual appeal. To address this issue, we propose a structure-
to-detail portrait correction model named ImagePC. It in-
tegrates the long-range awareness of transformer and multi-
step denoising of diffusion models into a unified framework,
achieving global structural robustness and local detail refine-
ment. Besides, considering the high cost of obtaining video
labels, we then repurpose ImagePC for unlabeled wide-angle
videos (termed VideoPC), by spatiotemporal diffusion adap-
tion with spatial consistency and temporal smoothness con-
straints. For the former, we encourage the denoised image to
approximate pseudo labels following the wide-angle distor-
tion distribution pattern, while for the latter, we derive rectifi-
cation trajectories with backward optical flows and smooth
them. Compared with ImagePC, VideoPC maintains high-
quality facial corrections in space and mitigates the potential
temporal shakes sequentially in blind scenarios. Finally, to
establish an evaluation benchmark and train the framework,
we establish a video portrait dataset with a large diversity in
people number, lighting conditions, and background. Experi-
ments demonstrate that the proposed methods outperform ex-
isting solutions quantitatively and qualitatively, contributing
to high-fidelity wide-angle videos with stable and natural por-
traits. The codes and dataset will be available.

AAAI 2026

Beyond Wide-Angle Images: Structure-to-Detail Video Portrait Correction via Unsupervised Spatiotemporal Adaptation

warping

video generation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Visual Speech Recognition (VSR), commonly known as lipreading, enables the recognition of spoken text by analyzing lip visual features. Due to the subtlety of lip movements, its recognition is much harder than other motion recognition tasks. Existing VSR models face the challenge of viseme ambiguity when processing phonemes with similar pronunciations—multiple phonemes share similar viseme features, leading to a notable drop in lipreading accuracy. To address this issue, this study proposes a Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition(LinProVSR) framework. First, an ambiguous sample set is constructed based on linguistic knowledge to provide supervisory signals for the model's training. Then, a Progressive Contrastive Disambiguation Network (PCDN) is designed, which progressively enhances the model's ability to capture the subtle viseme differences corresponding to similar phonemes through viseme-phoneme contrastive disambiguation in the encoding stage and text contrastive disambiguation in the decoding stage. Furthermore, we pioneer the Ambiguous Word Error Rate (AWER) metric specifically for evaluating recognition of phonetically ambiguous text, and verify the effectiveness of the proposed method on multiple public datasets, achieving a significant breakthrough especially in distinguishing visually similar phonemes.

LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have proposed incorporating background knowledge, typically in the form of must‑link and cannot‑link constraints, to guide the clustering process.
With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint‑generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This improves both query efficiency and constraint accuracy compared to state‑of‑the‑art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

The success of large language models (LLMs) in cognitive tasks prompts the question of whether their next-token prediction (NTP) paradigm can be adapted to model physiological signals from wearable devices. A key target for this adaptation is photoplethysmography (PPG), the most prevalent sensing modality in consumer wearables for non-invasive monitoring of diverse physiological conditions. Unlike in NLP, where NTP aligns with generative objectives, physiological signal analysis involves fundamentally different tasks, such as continuous parameter estimation (regression) and discrete state recognition (classification). This disparity creates a semantic mismatch between the pre-training paradigm and the downstream tasks. To bridge this gap, we propose PPGPT, the first foundation model that reformulates NTP into next-feature token prediction (NFTP), learning hierarchical feature transition probabilities to unify pre-training and downstream objectives. PPGPT features a novel dual-stream encoder that generates feature tokens by jointly modeling temporal dynamics and local-global morphological patterns. The model is developed using a two-stage training framework: it is first pre-trained on a large-scale mixed dataset of 1.6 billion data points and then validated on our newly released BioMTL benchmark, which includes data from 172 subjects over 285 days across seven different tasks. Extensive experiments show that PPGPT significantly outperforms competing methods, achieving a 16.5\% improvement in F1-score and a 25.9\% reduction in Mean Absolute Error (MAE). Furthermore, the model demonstrates robust few-shot learning capabilities.

PPGPT: Transferring Next-Token Modeling from Language to PPG Signals

Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. 
Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. 
However, they often introduce substantial computational costs, including increased token consumption and inference latency. 
To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. 
To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. 
LiR$^3$AG significantly reduce the average 98\% output tokens overhead and 58.6\% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2\% to 22.5\% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

LiR3AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

Deep Unrolling Networks (DUNs) integrate classical optimization recovery problems in Compressed Sensing (CS) with sophisticated deep learning network architectures, leading to substantial breakthroughs. However, prevailing DUNs generally face challenges concerning solidified gradient descent step size strategies, inadequate feature extraction within the iterative stage and limited information interaction between iterative stages. To overcome these obstacles, we propose SCU-Net, a channel-focused unrolling network inspired by the renowned spectral projected gradient optimization algorithm. In particular, we tailore two pivotal components, Barzilai-Borwein-gradient Descent Optimizer (BBDO) and Channel-guided Cross-attention Reconstruction Module (CCRM), to collaboratively undertake the reconstruction task. BBDO leverages a gradient calculation strategy based on BB step size to enhance data fidelity optimization, while CCRM addresses the intricate mapping issue associated with sparse induction, encompassing customized functionalities from Adaptive Channel Interaction Layer (ACIL) and Spatially Augmented Channel-aware Unit (SACU). Among them, ACIL amalgamates convolution operations and channel attention mechanisms to achieve meticulous information screening alongside efficient feature enhancement. SACU introduces dual reinforcement variables to bolster information exchange across different iterative stages, coupled with the optimization of cross-attention to facilitate the modeling of long-distance dependencies. Extensive experiments in both image CS and magnetic resonance imaging exhibit that our SCU-Net manifests superior performance, surpassing state-of-the-art methods.

Spectrally Adaptive Channel-aware Unrolling Network for Compressed Sensing

While Semi-asynchronous federated learning (SAFL) combines the efficiency of synchronous training with the flexibility of asynchronous updates, it inherently suffers from participation bias, which is further exacerbated by non-IID data distributions. More importantly, hierarchical architecture shifts participation from individual clients to client groups, thereby further intensifying this issue. Despite notable advancements in SAFL research, most existing works still focus on conventional cloud-end architectures while largely overlooking the critical impact of non-IID data on scheduling across the cloud–edge–client hierarchy. To tackle these challenges, we propose FedCure, a innovative semi-asynchronous Federated learning framework that leverages coalition construction and participation-aware scheduling to mitigate participation bias with non-IID data. Specifically, FedCure operates through three key rules: (1) a preference rule that optimizes coalition formation by maximizing collective benefits and establishing theoretically stable partitions to reduce non-IID-induced performance degradation; (2) a scheduling rule that integrates the virtual queue technique with Bayesian-estimated coalition dynamics, mitigating efficiency loss while ensuring mean rate stability; and (3) a resource allocation rule that enhances computational efficiency by optimizing client CPU frequencies based on estimated coalition dynamics while satisfying delay requirements. Comprehensive experiments on four real-world datasets demonstrate that FedCure improves accuracy by up to 5.1x compared with four state-of-the-art baselines, while significantly enhancing efficiency with the lowest coefficient of variation 0.0223 for per-round latency and maintaining long-term balance across diverse scenarios.

FedCure: Mitigating Participation Bias in Semi-Asynchronous Federated Learning with Non-IID Data

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Our code is available at: https://github.com/Johumliu/FD-CMKD.

Distilling Cross-Modal Knowledge via Feature Disentanglement

This paper proposes SR-KI, a novel approach for integrating real-time and large-scale structured knowledge bases (KBs) into large language models (LLMs). SR-KI begins by encoding KBs into key-value pairs using a pretrained encoder, and injects them into LLMs' KV cache. Building on this representation, we employ a two-stage training paradigm: first locating a dedicated retrieval layer within the LLM, and then applying an attention-based loss at this layer to explicitly supervise attention toward relevant KB entries. Unlike traditional retrieval-augmented generation methods that rely heavily on the performance of external retrievers and multi-stage pipelines, SR-KI supports end-to-end inference by performing retrieval entirely within the model’s latent space. This design enables efficient compression of injected knowledge and facilitates dynamic knowledge updates. Comprehensive experiments demonstrate that SR-KI enables the integration of up to 40K KBs into a 7B LLM on a single A100 40GB GPU, and achieves strong retrieval performance—maintaining over 98% Recall@10 on the best-performing task and exceeding 88% on average across all tasks. Task performance on question answering and KB ID generation also demonstrates that SR-KI maintains strong performance while achieving up to 99.75% compression of the injected KBs.

SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention

Temporal Graph Neural Network (TGNN) explanation has attracted increasing attention due to its applicability in dynamic scenarios such as recommendation systems. However, existing explanation methods for TGNNs face two key limitations: (1) computational inefficiency and (2) a restricted focus on either factual or counterfactual explanations, but not both. In this paper, we propose TGX-QIEA, an efficient and unified explanation algorithm based on a quantum-inspired evolutionary algorithm. TGX-QIEA effectively generates explanatory subgraphs that significantly influence TGNN predictions, without requiring additional model training or extensive inference. Experimental results on real-world datasets demonstrate that TGX-QIEA improves explanation fidelity by up to 31\% while reducing computation time by up to 92\% compared to state-of-the-art baselines.

Explaining Temporal Graph Neural Network via Quantum-Inspired Evolutionary Algorithm

This paper introduces Conformal Interquantile Regression (CIR), a novel conformal regression method designed to rapidly produce the smallest possible prediction intervals with guaranteed coverage. CIR employs black-box machine learning models to directly estimate outcome distributions through interquantile ranges and then converts these estimates into concise prediction intervals, achieving approximate conditional coverage. Base on CIR, we also introduce a variant, Conditional Interquantile Regression with More Comparation (CIR+), which incorporates an additional decision mechanism that evaluates whether to retain or discard a specific interquantile interval based on its length. The additional step in CIR+ results in slightly narrower prediction set widths while maintaining comparable coverage performance. Both of methods solve two main problems found in other distributional conformal prediction methods: they work well with skewed data, which is challenging for methods like Conformalized Quantile Regression, and they are computationally far more efficient than Conformal Histogram Regression by avoiding the histogram construction process. Empirical studies using both synthetic and real-world datasets demonstrate that our methods achieve the best balance between predictive performance and computational efficiency compared to other approaches.

Content not yet available

Next from AAAI 2026

LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES