Singapore

Recent studies have explored the capabilities of large language models (LLMs) in solving knowledge-intensive mathematical reasoning problems. However, existing benchmarks predominantly involve static theorems that LLMs have encountered during pretraining, making it difficult to assess whether these models can incorporate new or evolving knowledge into their reasoning processes. In this work, we introduce TaxReasoning, a novel benchmark designed to evaluate LLMs’ abilities in real-world tax calculation scenarios. These tasks require not only mathematical reasoning and numerical computation, but also the extraction and application of complex, frequently updated tax regulations. Through extensive experiments with state-of-the-art LLMs using diverse prompting strategies and knowledge augmentation techniques, we uncover substantial limitations in their ability to handle dynamic, knowledge-intensive questions—primarily due to missing domain-specific knowledge and ineffective retrieval. Even the best-performing models fall significantly short of human-level performance. Our analysis points to key avenues for improvement, including enhancing LLMs’ reasoning capabilities, developing more effective knowledge summarization techniques, and improving retrieval strategies. TaxReasoning offers a challenging new testbed for advancing LLMs toward more reliable reasoning in real-world, evolving, and knowledge-intensive domains.

AAAI 2026

TaxReasoning: Benchmarking Knowledge-Intensive Mathematical Reasoning with Evolving Tax Laws

nlp: question answering

nlp: (large) language models

nlp: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In the context of global population aging, the prevalence of neurodegenerative diseases is rapidly increasing. Vision-based impaired gait analysis emerges as a promising alternative for automatic and non-invasive diagnosis. While prior efforts have advanced either accuracy or interpretability of gait analysis, few have effectively addressed both aspects in a unified framework. To bridge this gap, we propose DPPD, a Diffusion-based Personalized Pathology Disentanglement model that jointly performs quantitative gait scoring, dementia subtyping, and qualitative anomaly highlighting. Motivated by the observation that pathological gait features exhibit stronger inter-class separability across different gait severity than raw features, DPPD is proposed based on the subject-specific pathology disentanglement perspective. Specifically, it comprises three key components: (1) a 3DmotionBERT for encoding gait representation from 3D human pose sequences estimated, (2) a latent diffusion-based Gait Denoiser for generating personalized normal gait features, and (3) a Dual Pathology Disentanglement mechanism that captures both static pose and dynamic motion pathological representation from the residual between raw and normal gait features. These disentangled pathologies further enable quantitative classification and qualitative anomaly highlighting. Experiments on the PDGait and 3DGait datasets demonstrate that DPPD outperforms state-of-the-art methods in classification accuracy while providing reliable and interpretable visualizations of gait anomalies.

Diffusion-based Personalized Pathology Disentanglement for Impaired Gait Analysis

Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. The code will be released publicly.

Decoupling Continual Semantic Segmentation

Traditional short video recommendations primarily enhance user retention by reinforcing existing user preferences, potentially leading to information cocoons. Conversely, proactive recommendations aim to diversify user interests by exposing users to content beyond their historical preferences. However, current proactive approaches face three limitations: (1) homogeneous receptivity assumption, neglecting individual differences in users' openness to new interests; (2) short-term item exposure without interest anchoring, focusing on item-level shifts rather than interest evolution; and (3) static feedback utilization, failing to incorporate dynamic user feedback during the recommendation adequately. To address these challenges, we propose **ProRec-Video**, a proactive framework that guides hierarchical interest transitions through three innovations. First, *User Receptivity Profiling* assesses individual openness for new interests, ensuring personalized transition pacing. Second, *Hierarchical Interest Transition Planning* decomposes complex interest shifts into intermediate steps to generate smooth interest transition paths and semantically coherent video sequences, addressing overemphasis on item exposure. Third, *Dynamic Feedback Adaptation* integrates agent-based simulation and Reflexion mechanisms to refine interest transition paths and video sequences based on real-time user feedback, enhancing adaptability and satisfaction. Extensive experiments on two datasets demonstrate that ProRec-Video achieves a significant improvement in proactive recommendation performance, with an interest transition success rate of 85\% and a user satisfaction rate of 78.3\%.

ProRec-Video: Guiding Hierarchical Interest Transitions for Proactive Short Video Recommendation with Dynamic Feedback Adaptation

The Speaker Diarization and Recognition (SDR) task aims to predict ``who spoke when and what'' within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

With the widespread use of location-tracking technologies, large volumes of trajectory data are continuously generated. Trajectory similarity computation is a core task in trajectory mining with broad applications. However, existing methods still face two key challenges: (1) difficulty in balancing efficiency and representation quality, and (2) reliance on a single training paradigm, which limits the ability to capture both pairwise similarity and batch-level coherence. To address the challenges mentioned above, we propose a trajectory similarity computation framework, named TrajAgg. Specifically, our framework incorporates a novel aggregation transformer that efficiently aggregates GPS and grid features through two stages of direct interaction and enhances the expressiveness of the resulting trajectory embeddings. In addition, by integrating two distinct training paradigms, our model captures both fine-grained pairwise relationships and global structural consistency. We further analyze its effectiveness from the perspective of mutual information. Extensive experiments on three publicly available datasets show that TrajAgg consistently outperforms state-of-the-art baselines. Our method achieves average improvements of 15.11%, 16.49%, and 40.15% in HR@1 under three distance measures across three datasets, respectively. The code of our model is provided in the appendix.

TrajAgg: Dual-Scale Feature Aggregation with Hybrid Training for Trajectory Similarity Computation in Free Space

Downsampling is essential in semantic segmentation for reducing computational cost and guiding the learning of class-discriminative features. Existing models typically rely on strided convolutions or patch splitting to obtain features with lower resolution. However, we observe that such operations often introduce edge jagging and texture degradation, the underlying cause is that aliasing of the high frequency induces phase distortion. We conducted a systematic analysis of phase distortion and identified two key properties: spatial non-uniformity (concentrated near boundaries) and directional sparsity (accumulated along a few dominant directions). These properties cause crucial high-frequency cues to be misrepresented or lost during sampling. To address this issue, we propose a frequency aware filter consisting of two complementary modules: a dynamic Gaussian kernel (DGK) and a learnable Gabor-based frequency selector (LFS). To mitigate spatial non-uniformity, the DGK predicts edge normals from gradients, applies strong low-pass filtering along the normal direction, and leaves the tangential direction virtually untouched, thereby suppressing phase distortion while preserving contour continuity. To handle directional sparsity, the Learnable Gabor Selector (LFS) then performs directional band-pass filtering to attenuate residual aliasing peaks and adaptively boost informative texture. We further introduce phase-error energy (PE) to quantify distortion severity. Visualization and quantitative results demonstrate that frequency-aware filter offers a plug-and-play remedy for aliasing, yielding sharper boundaries and consistent gains across datasets.

Revisiting Downsampling in Semantic Segmentation: Fighting Aliasing with Dynamic Gaussian and Gabor Frequency Filters

Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions.To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.

SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal

Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.

RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. 
The task integrates three subtasks—emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning—to jointly model affective states. 
While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: 
(1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; 
and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels.
We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. 
First, we employ instruction fine-tuning to establish basic emotional reasoning capability for reducing hallucinations. 
Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. 
Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model.
Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. 
Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
The code will be made available.

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model’s generalization and learning efficiency.

To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6\% on MT50 and by 10.5\% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.

Content not yet available

Next from AAAI 2026

Diffusion-based Personalized Pathology Disentanglement for Impaired Gait Analysis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES