Singapore

With the development of Large Language Models (LLMs), there
is growing interest in how we can apply the knowledge of
LLMs to tasks beyond text generation and question
answering. Speech processing, a field of interest for
decades, has recently seen successful applications of LLM
architectures following the release of the Transformer
architecture such as in the Whisper models. While
significant development has occurred in improving LLM
capabilities through Supervised Fine-Tuning (SFT),
reasoning, and alignment with Reinforcement Learning (RL),
their application to the speech domain remain somewhat
underexplored. Recent work has demonstrated that fine
tuning LoRA (Low-Rank Adaptation) adapters for LLMs can
perform Automatic Speech Recognition (ASR) tasks natively,
leveraging existing LLM capabilities and bypassing the
pre-training stage. However, no approaches have yet to
successfully apply LLM knowledge in a similar fashion to
other speech processing tasks like speaker diarisation.
Current approaches utilise LLMs as a post-processing step
on the outputs of a speaker diarisation model, but no model
based on LLMs has yet to be able to natively perform
speaker diarisation. Therefore, this research proposal
explores how we can create LoRA adapters for LLMs to
perform speaker diarisation tasks natively, and explores
how we can also overcome the speech domain’s reliance on
annotated data.

AAAI 2026

Native Speech Processing with LLMs

prompt engineering

speech

large language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

This proposal aims to investigate epistemic uncertainty -
uncertainty about knowledge or truth, often conveyed by
modals like might or probably in Large Language Models
(LLMs). By probing how such cues affect reasoning, we seek
to achieve controllable epistemic sensitivity: enabling mod-
els to interpret and adapt to uncertainty. Using activation-
level analyses and multilingual benchmarks, this work ad-
vances transparent, context-aware, and trustworthy reasoning
in uncertainty-critical domains.

Controllable Epistemic Sensitivity in Large Language Models: Probing, Benchmarking, and Adaptive Reasoning

Post-training quantization is a widely adopted technique
for compressing large language models (LLMs), enabling
efficient deployment in resource-constrained environments.
However, recent studies have revealed that
quantization—especially aggressive methods such as 4-bit
QLoRA and Straight-Through Estimators (STE)—can
significantly degrade a model’s safety alignment,
increasing its susceptibility to harmful prompt
completions and jailbreak behaviors. This research
investigates the safety risks introduced by quantization
and proposes a novel mitigation strategy: projecting
quantized parameters back into safety-aligned subspaces.
Building on prior work such as SafeLoRA, the study aims to
empirically evaluate safety degradation across benchmark
datasets (PureBad, Dialog Summary, Alpaca) using metrics
like Harmfulness Score, Attack Success Rate (ASR), and
StrongReject Score. The second phase explores
projectionbased restoration techniques to recover
alignment-preserving directions in parameter space.
Finally, the effectiveness of these interventions will be
assessed through end-to-end evaluations. By addressing the
overlooked safety implications of model compression, this
research contributes toward the development of robust,
ethically aligned LLMs suitable for realworld deployment

Unveiling AI Safety in Fine-tuning Quantized Model

Bipolar disorder poses significant challenges due to
disruptive manic episodes often missed by traditional
clinical monitoring. This research proposes a multimodal AI
framework that integrates keystroke dynamics and circadian
rhythm analysis via smartphones to predict manic episodes 3
to 7 days prior to clinical onset. Combining fine-grained
cognitive-motor behavioral features from typing patterns
with physiological markers of circadian disruption, this
approach leverages temporal convolutional and LSTM networks
enhanced with attention mechanisms for robust prediction.
The model will be validated through longitudinal monitoring
to assess predictive accuracy and reliability. Such early
detection can enable timely interventions, reducing
personal and societal burdens while advancing digital
mental health methodologies rooted in precision psychiatry.

Multimodal Digital Phenotyping for Early Prediction of Manic Episodes Through Keystroke Dynamics and Circadian Pattern Analysis

Arbitrary-Oriented Object Detection (AOOD) has found broad applications in embodied intelligence, autonomous driving, and satellite remote sensing. However, current AOOD frameworks face challenges in ineffective feature extraction and orientation regression inaccuracy. Inspired by Hilbert curve's intrinsic locality-preserving property, we propose a flexible \textbf{H}ilbert curve-\textbf{E}ncoded \textbf{R}otation-Equivariant \textbf{O}riented Object \textbf{Det}ector, termed \textbf{HERO-Det}. Our key innovations include: (i) a novel Hilbert curve traversal convolution paradigm with a dimensionality reduction scheme, which employs locality-preserving spatial filling curves for feature transformation, (ii) a Hilbert pyramid transformer enabling hierarchical construction of multi-scale feature sequences through space-folding operations, as well as (iii) an orientation-adaptive prediction head that decouples rotation-equivariant regression features from invariant classification cues to resolve orientation regression dilemmas in two-stage detectors. Extensive experiments show HERO-Det achieves state-of-the-art performance on AOOD benchmarks, with mAP of 79.56\%, 90.64\%, 90.10\%, and 80.47\% on DOTA, HRSC2016, SSDD, and HRSID, respectively. Consistent performance gains in cross-task validation further demonstrate the versatility of our method to diverse vision tasks, such as medical image segmentation and 3D object detection. Code is available at https://github.com/Qian-CV/HERO-Det.

Hilbert Curve-Encoded Rotation-Equivariant Oriented Object Detector with Locality-Preserving Spatial Mapping

3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high‑fidelity reconstruction. However, their multi‑step iterative sampling incurs significant computational overhead, limiting real‐time applicability. To address this, we propose LiNeXt—a lightweight, non‐diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise‑to‑Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi‑step iterative sampling of diffusion‑based methods. The Refine Module then takes the coarse point cloud and its intermediate features from N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance‑dependent spatial distribution—densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance‑aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8$\times$ speedup in inference, reduces Chamfer Distance by 50.7\%, and uses only 6.1\% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.

LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures

Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character’s visual performance. However, existing alignment approaches based on visual features face two key limitations: (1) they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion state depicted in the video, which is robust to visual domain variations. Second, we design an instructed duration distilling module to mine discriminative duration cues from speaking rate instructions to predict lip-aligned phoneme-level pronunciation duration. Third, for emotion-prosody alignment, we devise an instructed emotion calibrating module, which fine-tunes an LLM-based instruction analyzer using ground truth dubbing emotion as supervision and predicts prosody based on the calibrated emotion analysis. Finally, the predicted duration and prosody, together with the script, are fed into the audio decoder to generate video-aligned dubbing. Extensive experiments on three major benchmarks demonstrate that InstructDubber outperforms state‑of‑the‑art approaches across both in‑domain and zero‑shot scenarios.

InstructDubber: Instruction-based Alignment for Zero-shot Movie Dubbing

Typical detection-free methods for image-to-point cloud registration leverage transformer-based architectures to aggregate cross-modal features and establish correspondences. However, they often struggle under challenging conditions, where noise disrupts similarity computation and leads to incorrect correspondences. Moreover, without dedicated designs, it remains difficult to effectively select informative and correlated representations across modalities, thereby limiting the robustness and accuracy of registration.
To address these challenges, we propose a novel cross-modal registration framework composed of two key modules: the Iterative Agents Selection (IAS) module and the Reliable Agents Interaction (RAI) module. IAS enhances structural feature awareness with phase maps and employs reinforcement learning principles to efficiently select reliable agents. RAI then leverages these selected agents to guide cross-modal interactions, effectively reducing mismatches and improving overall robustness.
Extensive experiments on the RGB-D Scenes v2 and 7-Scenes benchmarks demonstrate that our method consistently achieves state-of-the-art performance.

Adaptive Agent Selection and Interaction Network for Image-to-Point Cloud Registration

Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of state space models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba consistently outperforms state-of-the-art methods with less computational complexity and inference latency.

DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion

Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $\alpha$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $\alpha$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.

Task-Specific Distance Correlation Matching for Few-Shot Action Recognition

Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

Downloads

Next from AAAI 2026

Controllable Epistemic Sensitivity in Large Language Models: Probing, Benchmarking, and Adaptive Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Controllable Epistemic Sensitivity in Large Language Models: Probing, Benchmarking, and Adaptive Reasoning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads