Singapore

The Speaker Diarization and Recognition (SDR) task aims to predict ``who spoke when and what&#39;&#39; within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

AAAI 2026

SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

nlp: speech

The Speaker Diarization and Recognition (SDR) task aims to predict ``who spoke when and what'' within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

With the widespread use of location-tracking technologies, large volumes of trajectory data are continuously generated. Trajectory similarity computation is a core task in trajectory mining with broad applications. However, existing methods still face two key challenges: (1) difficulty in balancing efficiency and representation quality, and (2) reliance on a single training paradigm, which limits the ability to capture both pairwise similarity and batch-level coherence. To address the challenges mentioned above, we propose a trajectory similarity computation framework, named TrajAgg. Specifically, our framework incorporates a novel aggregation transformer that efficiently aggregates GPS and grid features through two stages of direct interaction and enhances the expressiveness of the resulting trajectory embeddings. In addition, by integrating two distinct training paradigms, our model captures both fine-grained pairwise relationships and global structural consistency. We further analyze its effectiveness from the perspective of mutual information. Extensive experiments on three publicly available datasets show that TrajAgg consistently outperforms state-of-the-art baselines. Our method achieves average improvements of 15.11%, 16.49%, and 40.15% in HR@1 under three distance measures across three datasets, respectively. The code of our model is provided in the appendix.

TrajAgg: Dual-Scale Feature Aggregation with Hybrid Training for Trajectory Similarity Computation in Free Space

Downsampling is essential in semantic segmentation for reducing computational cost and guiding the learning of class-discriminative features. Existing models typically rely on strided convolutions or patch splitting to obtain features with lower resolution. However, we observe that such operations often introduce edge jagging and texture degradation, the underlying cause is that aliasing of the high frequency induces phase distortion. We conducted a systematic analysis of phase distortion and identified two key properties: spatial non-uniformity (concentrated near boundaries) and directional sparsity (accumulated along a few dominant directions). These properties cause crucial high-frequency cues to be misrepresented or lost during sampling. To address this issue, we propose a frequency aware filter consisting of two complementary modules: a dynamic Gaussian kernel (DGK) and a learnable Gabor-based frequency selector (LFS). To mitigate spatial non-uniformity, the DGK predicts edge normals from gradients, applies strong low-pass filtering along the normal direction, and leaves the tangential direction virtually untouched, thereby suppressing phase distortion while preserving contour continuity. To handle directional sparsity, the Learnable Gabor Selector (LFS) then performs directional band-pass filtering to attenuate residual aliasing peaks and adaptively boost informative texture. We further introduce phase-error energy (PE) to quantify distortion severity. Visualization and quantitative results demonstrate that frequency-aware filter offers a plug-and-play remedy for aliasing, yielding sharper boundaries and consistent gains across datasets.

Revisiting Downsampling in Semantic Segmentation: Fighting Aliasing with Dynamic Gaussian and Gabor Frequency Filters

Lens flare is a common nighttime artifact caused by strong light sources scattering within camera lenses, leading to hazy streaks, halos, and glare that degrade visual quality. However, existing methods usually fail to effectively address nonuniform scattered flares, which severely reduces their applicability to complex real-world scenarios with diverse lighting conditions.To address this issue, we propose SLCFormer, a novel spectral-local context transformer framework for effective nighttime lens flare removal. SLCFormer integrates two key modules: the Frequency Fourier and Excitation Module (FFEM), which captures efficient global contextual representations in the frequency domain to model flare characteristics, and the Directionally-Enhanced Spatial Module (DESM) for local structural enhancement and directional features in the spatial domain for precise flare removal. Furthermore, we introduce a ZernikeVAE-based scatter flare generation pipeline to synthesize physically realistic scatter flares with spatially varying PSFs, bridging optical physics and data-driven training. Extensive experiments on the Flare7K++ dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches in both quantitative metrics and perceptual visual quality, and generalizing robustly to real nighttime scenes with complex flare artifacts.

SLCFormer: Spectral-Local Context Transformer with Physics-Grounded Flare Synthesis for Nighttime Flare Removal

Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.

RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. 
The task integrates three subtasks—emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning—to jointly model affective states. 
While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: 
(1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; 
and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels.
We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. 
First, we employ instruction fine-tuning to establish basic emotional reasoning capability for reducing hallucinations. 
Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. 
Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model.
Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. 
Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.
The code will be made available.

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Multi-task reinforcement learning (MTRL) seeks to learn a unified policy for diverse tasks, but often suffers from gradient conflicts across tasks. Existing masking-based methods attempt to mitigate such conflicts by assigning task-specific parameter masks. However, our empirical study shows that coarse-grained binary masks have the problem of over-suppressing key conflicting parameters, hindering knowledge sharing across tasks. Moreover, different tasks exhibit varying conflict levels, yet existing methods use a one-size-fits-all fixed sparsity strategy to keep training stability and performance, which proves inadequate. These limitations hinder the model’s generalization and learning efficiency.

To address these issues, we propose SoCo-DT, a Soft Conflict-resolution method based by parameter importance. By leveraging Fisher information, mask values are dynamically adjusted to retain important parameters while suppressing conflicting ones. In addition, we introduce a dynamic sparsity adjustment strategy based on the Interquartile Range (IQR), which constructs task-specific thresholding schemes using the distribution of conflict and harmony scores during training. To enable adaptive sparsity evolution throughout training, we further incorporate an asymmetric cosine annealing schedule to continuously update the threshold. Experimental results on the Meta-World benchmark show that SoCo-DT outperforms the state-of-the-art method by 7.6\% on MT50 and by 10.5\% on the suboptimal dataset, demonstrating its effectiveness in mitigating gradient conflicts and improving overall multi-task performance.

Soft Conflict-Resolution Decision Transformer for Offline Multi-Task Reinforcement Learning

Real-world scientific applications frequently encounter incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Although neural operators significantly advanced PDE solving in terms of computational efficiency and accuracy, their underlying assumption of fully-observed spatial inputs severely restricts applicability in real-world application. We introduce the first systematic framework for learning neural operators from partial observation. We identify and formalize two fundamental obstacles: (i) the supervision gap in unobserved regions that prevents effective learning of physical correlations, and (ii) the dynamic spatial mismatch between incomplete inputs and complete solution fields. Specifically, our proposed LANO (Latent Autoregressive Neural Operator) introduces two novel components designed explicitly to address the core difficulties of partial observations: (i) a mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (ii) a Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Additionally, we develop POBench-PDE, a dedicated and comprehensive benchmark designed specifically for evaluating neural operators under partial observation conditions across three PDE-governed tasks. LANO achieves state-of-the-art performance with 18--69$\%$ relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50$\%$ missing rate, including real-world climate prediction. Our approach effectively addresses practical scenarios involving up to 75$\%$ missing rate, to some extent bridging the existing gap between idealized research settings and the complexities of real-world scientific computing.

Learning Neural Operators from Partial Observations via Latent Autoregressive Modeling

Latent Diffusion Models (LDMs) have achieved remarkable success in image generation tasks, yet their low barrier to customization poses severe threats related to art plagiarism. As a countermeasure, adversarial methods have been proposed to protect artworks from plagiarism. However, current methods suffer from limited effectiveness, high cost, and complex optimization. Moreover, their exploration and exploitation of LDM vulnerabilities remain limited, restricting effectiveness and applicability. To address this issue, we conduct an in-depth analysis of the VAE and U-Net components within LDMs, revealing their inherent vulnerabilities. Specifically, we study the response of U-Net to specific structural and frequency patterns in the latent space and find that it is susceptible to high-frequency and periodic latent features. Furthermore, we observe significant channel correlations during the VAE encoding process. Inspired by these, we propose QRShield, an efficient protection method that exploits the vulnerabilities of LDMs. By constructing consistent high-frequency and periodic features across latent channels and combining them with a momentum-based translation-invariant attack strategy, QRShield achieves stronger and more efficient protection. QRShield significantly improves protection performance in various fine-tuning settings, with over 10\% gains in multiple metrics, a threefold increase in generation speed, and nearly 50\% reduction in memory usage. Our work deeply reveals the vulnerabilities of LDMs and proposes a more practical tool to prevent AI art plagiarism.

QRShield: Exploiting Vulnerabilities of Latent Diffusion Models for Preventing AI Art Plagiarism

Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT — Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments conducted on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9\% improvement in SSIM) but also achieves significantly improved inference speed (achieving a more than fourfold reduction), making it particularly suitable for real-time atmospheric turbulence suppression tasks.

RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator

Monte Carlo random walk methods are widely used in capacitance extraction for their mesh-free formulation and inherent parallelism. However, modern semiconductor technologies with densely packed structures present significant challenges in unbiasedly sampling transition domains in walk steps with multiple high-contrast dielectric materials. We present DeepRWCap, a machine learning-guided random walk solver that predicts the transition quantities required to guide each step of the walk. These include Poisson kernels, gradient kernels, signs and magnitudes of weight. DeepRWCap employs a two-stage neural architecture that decomposes structured outputs into face-wise distributions and spatial kernels on cube faces. It uses 3D convolutional networks to capture volumetric dielectric interactions and 2D depthwise separable convolutions to model localized kernel behavior. The design incorporates grid-based positional encodings and structural design choices informed by cube symmetries to reduce learning redundancy and improve generalization. Trained on 100,000 procedurally generated dielectric configurations, DeepRWCap achieves a mean relative error of $1.24\pm0.53$\% when benchmarked against the commercial Raphael solver on the self-capacitance estimation of 10 industrial designs spanning 12 to 55 nm nodes. Compared to the state-of-the-art stochastic difference method Microwalk, DeepRWCap achieves an average 23\% speedup. On complex designs with runtimes over 10s, it reaches an average 49\% acceleration.

Downloads

Next from AAAI 2026

TrajAgg: Dual-Scale Feature Aggregation with Hybrid Training for Trajectory Similarity Computation in Free Space

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES