Singapore

Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each style type. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and nationality classification.
Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking style while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.

AAAI 2026

ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations from Speech

speaking style control

speech understanding

representation learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Cognitive computing models offer a formal and interpretable way to characterize human's deliberation and decision-making, yet their development remains labor-intensive. In this paper, we propose NL2CA, a novel method for auto-formalizing cognitive decision-making rules from natural language descriptions of human experience. Different from most related work that exploits either pure manual or human-guided interactive modeling, our method is fully automated without any human intervention. The approach first translates text into Linear Temporal Logic (LTL) using a fine-tuned large language model (LLM), then refines the logic via an unsupervised Critic Tree, and finally transforms the output into executable production rules compatible with symbolic cognitive frameworks. Based on the resulted rules, a cognitive agent is further constructed and optimized through cognitive reinforcement learning according to the real-world behavioral data. Our method is validated in two domains: (1) NL-to-LTL translation, where our CriticNL2LTL module achieves consistent performance across both expert and large-scale benchmarks without human-in-the-loop feedbacks, and (2) cognitive driving simulation, where agents automatically constructed from human interviews have successfully learned the diverse decision patterns of about 70 trials in different critical scenarios. Experimental results demonstrate that NL2CA enables scalable, interpretable, and human-aligned cognitive modeling from unstructured textual data, offering a novel paradigm to automatically design symbolic cognitive agents.

NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework

Proactive intention decoding remains a critical yet underexplored challenge in brain–machine interfaces (BMIs), especially under naturalistic, self-initiated behavior. Existing systems rely on reactive decoding of motor cortex signals, resulting in substantial latency. To address this, we leverage the common marmoset’s spontaneous vocalizations and develop a high-resolution, dual-region ECoG recording paradigm targeting the prefrontal and auditory cortices and a neural decoding framework that integrates shapelet-based temporal encoding, position-aware attention, frequency-aware channel masking, contrastive clustering and a minimum error entropy-based robust loss. Our approach achieves 91.9\% accuracy up to 200 ms before vocal onset—substantially outperforming 13 competitive baselines. Our model also uncovers a functional decoupling between auditory and prefrontal regions. Furthermore, joint modeling in time and frequency domains reveals novel preparatory neural signatures preceding volitional vocal output. Together, our findings bridge the gap between foundational neuroscience and applied BMI engineering, and establish a generalizable framework for intention decoding from ecologically valid, asynchronous behaviors. Code is available at [*********URL Blinded for Review*****************].

Spontaneous Yet Predictable: Shapelet-Driven, Channel-Aware Intention Decoding from Multi-Region ECoG

Graph Anomaly Detection (GAD) focuses on identifying instances that deviate from normal patterns in graph-structured data. Although substantial progress has been made in this field, current approaches are constrained by the "one-dataset-one-model" paradigm, exhibiting limited generalization across heterogeneous graphs, poor adaptability in few-shot scenarios, and inefficient cross-domain deployment. To overcome these limitations, we propose SAARCS, a universal GAD framework capable of performing anomaly detection across diverse graph datasets without requiring any target data training. SAARCS aligns feature dimensions through composite spatial smoothness, learns graph embeddings via an adaptive-hop attention encoder, and predicts node abnormality using only a small set of normal context nodes. Extensive experiments on eight real-world datasets demonstrate that our approach achieves superior performance compared to state-of-the-art baselines.

Multi-dimensional Adaptive Mix-hop Contextual Learning Framework for Universal Graph Anomaly Detection

Multimodal summarization with multimodal output (MSMO) aims to generate coherent textual summaries while selecting the most semantically relevant images to enhance expressiveness. Despite the advancements of large multimodal models like GPT-4o, LLaMA-3, and Grok-3, these models often exhibit hallucination and weak visual-text alignment when applied to MSMO tasks. To address these challenges, we propose ModalSyncSum, a unified framework that enhances semantic consistency and visual faithfulness. It incorporates image-aware information extraction to mitigate visual-text misalignment, QA-based description verification to detect and correct hallucinated image descriptions, and named entity-guided refinement to ensure factual accuracy and entity alignment across modalities. Furthermore, we introduce a new evaluation metric M$^3$AS, which jointly considers image content coverage, text-image alignment, and summary consistency, filling the gap in evaluating multimodal summary quality. Experimental results show that our model outperforms prompt-based baselines across multiple datasets, achieving significant gains on ROUGE, BLEU, and BERTScore, with BLEU improving by 21.95\%. In human evaluation, M$^3$AS exhibits stronger correlation with human judgments in consistency, image-summary relevance, and focus, surpassing existing automatic metrics.

ModalSyncSum: Synchronizing Image and Text for Reliable Summary Generation

Uncertainty over model knowledge is a core challenge in planning and has been addressed through various approaches tailored to different scenarios. In this paper, we focus on scenarios where the agent does not initially know the exact outcome of its actions but gains knowledge upon execution, i.e., each action reveals its actual effect, removing uncertainty about future occurrences. We refer to this formulation as Planning with Uncertain Models of Actions (PUMA). We show that PUMA can be compiled in polynomial time in both Fully Observable Non-Deterministic planning and, perhaps more unexpectedly, classical planning, providing a constructive proof that PUMA remains PSPACE-complete despite its apparent exponential uncertainty. Finally, we experimentally evaluate both compilations with benchmark domains that capture the key aspects of the problem. The results show the practical feasibility of our approach and reveal a complementary behavior between the two compilations.

Planning with Uncertain Action Models

Retrieval-augmented generation (RAG) has proven effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios.
Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. 
Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. 
This often leads to reward hacking and degraded response quality. 
We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions.
To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. 
This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. 
To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment.
Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.

Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Depression is a prevalent mental health disorder characterized by persistent sadness and a diminished interest in daily activities, early detection of depression facilitates timely intervention, mitigating its adverse effects. Electroencephalography (EEG) signals and eye movements are emerging as promising biomarkers for depression detection due to their non-invasive nature and cost-effectiveness. Nevertheless, existing studies suffer from methodological constraints, including low specificity, insufficient sample sizes, limited generalizability, and difficulties in large-scale replication, which collectively undermine their clinical utility. To address these challenges, we collected a large-scale depression dataset comprising EEG and eye movements from 1,060 individuals diagnosed with depression and 1,308 healthy controls. To efficiently leverage multimodal data for automatic depression detection, we propose the EEG-Eye Movements Model (E2Mo). E2Mo employs modality-specific encoders to extract discriminative multi-view features from each modality and incorporates a mixture-of-modality-experts architecture with multi pretraining tasks to achieve efficient and robust modality alignment and fusion. Our approach achieves a 70.06% balanced accuracy by leveraging multi-modal data, demonstrating the effectiveness of integrating EEG signals and eye movements for automatic depression detection.

A Multimodal EEG-Eye Movement Model for Automatic Depression Detection

Vision Transformers (ViTs) have gained significant attention and widespread adoption due to their impressive performance in various computer vision tasks. However, in practice, their substantial computational overhead often leads to high inference latency and increased overheads when deployed on resource-constrained edge devices like smartphones, autonomous vehicles, and robots.
To address these challenges, Early Exit (EE) has emerged as a promising approach for lightweight inference on edge devices. It accelerates inference and reduces computational overhead by adaptively producing predictions through early exits based on sample complexity. Existing EE methods typically suffer from substantial accuracy decreases in late exits while providing only marginal accuracy improvements to early exits. This paper presents EnViT, an exit-aware structured dropout-enabled self-distillation approach that enhances the performance of early exits without compromising late exits. EnViT leverages structured dropout to enable self-distillation, where the full model serves as the teacher and its own virtual sub-models generated by structured dropout as students. This mechanism effectively distills knowledge from the full model to early exits and avoids performance degradation in late exits by mitigating parameter conflicts across exits during training. Evaluation on five datasets shows that our EnViT achieves accuracy improvements ranging from 0.36\% to 7.92\% while maintaining competitive speed-up ratios of 1.72x to 2.23x.

EnViT: Enhancing the Performance of Early-Exit Vision Transformers via Exit-Aware Structured Dropout-Enabled Self-Distillation

Transparent models, which provide inherently interpretable predictions, are receiving significant attention in high-stakes domains. However, despite much real-world data being collected as time series, there is a lack of studies on transparent time series models. To address this gap, we propose a novel transparent neural network model for time series called Generalized Additive Time Series Model (GATSM). GATSM consists of two parts: 1) independent feature networks to learn feature representations, and 2) a transparent temporal module to learn temporal patterns across different time steps using the feature representations. This structure allows GATSM to effectively capture temporal patterns and handle varying-length time series while preserving transparency. Empirical experiments show that GATSM significantly outperforms existing generalized additive models and achieves comparable performance to black-box time series models, such as recurrent neural networks and Transformer. In addition, we demonstrate that GATSM finds interesting patterns in time series.

Transparent Networks for Multivariate Time Series

Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users.

Content not yet available

Next from AAAI 2026

NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES