Singapore

Temporal knowledge graph question answering (TKGQA) involves multi-hop reasoning over temporally constrained entity relationships in knowledge graph to answer a given question. 
However, at each hop, large language models (LLMs) retrieve subgraphs with numerous temporally similar and semantically complex relations, increasing the risk of suboptimal decisions and error propagation. 
To address these challenges, we propose the multi-hop reasoning enhanced (MRE) framework, which enhances both forward and backward reasoning to improve the identification of globally optimal reasoning trajectories.
Specifically, MRE begins with prompt engineering to guide LLM in generating diverse reasoning trajectories for the given question. Valid reasoning trajectories are then selected to supervise fine-tuning, serving as a cold-start strategy. Finally, we introduce tree-group relative policy optimization (T-GRPO)—a recursive, tree-structured learning-by-exploration approach. At each hop, exploration establishes strong causal dependencies on previous hop, while evaluation is informed by multi-path exploration feedback from subsequent hops. Experiments on two TKGQA benchmarks show that our MRE-based model outperforms state-of-the-art(SOTA) methods by 5.2% in accuracy on complex multi-hop queries. Further analysis highlights improved interpretability and robustness to noisy temporal annotations.

AAAI 2026

Reinforcement Learning Enhanced Muti-hop Reasoning for Temporal Knowledge Question Answering

nlp: question answering

nlp: applications

nlp: information extraction

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Modern deep learning techniques focus on extracting intricate information from data to achieve accurate predictions. However, the training datasets may be crowdsourced and include sensitive information, such as personal contact details, financial data, and medical records. As a result, there is a growing emphasis on developing privacy-preserving training algorithms for neural networks that maintain good performance while preserving privacy. In this paper, we investigate the generalization and privacy performances of the differentially private gradient descent (DP-GD) algorithm, which is a private variant of the gradient descent (GD) by incorporating additional noise into the gradients during each iteration. Moreover, we identify a concrete learning task where DP-GD can achieve superior generalization performance compared to GD in training two-layer Huberized ReLU convolutional neural networks (CNNs). Specifically, we demonstrate that, under mild conditions, a small signal-to-noise ratio can result in GD producing training models with poor test accuracy, whereas DP-GD can yield training models with good test accuracy and privacy guarantees if the signal-to-noise ratio is not too small. This indicates that DP-GD has the potential to enhance model performance while ensuring privacy protection in certain learning tasks. Numerical simulations are further conducted to support our theoretical results.

Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

In Psychotherapy, Early Maladaptive Schemas (EMS) are entrenched negative perceptions of self or others that perpetuate mental health challenges, contribute to treatment resistance and relapse, and obstruct therapeutic progress. Addressing EMS using appropriate psychotherapeutic support (PS) strategies helps resolve core emotional deficits, mitigate resistance, and improve client engagement. Moreover, adapting polite and empathetic communication based on clients’ emotional states fosters trust, emotional safety, and a conducive therapeutic environment, which is critical for addressing EMS and achieving positive outcomes. Motivated by these insights, we introduce MATE - a novel EMS-guided polite and empAthetic dialogue sysTem for psychothErapeutic support. MATE integrates a Large Language Model (LLM) with a Mixture of Experts-based Reinforcement Learning (MoE-RL) approach to overcome the limitations of traditional RL methods, such as large action spaces and generic responses. The LLM captures diverse semantic patterns from dialogue context. MoE-RL leverages dedicated psychotherapeutic, politeness, and empathy experts, along with a new reward function, comprising PS, politeness, empathy, contextual consistency, and diversity rewards to guide policy learning for effective response generation. Evaluations on the HOPE and PSYCON datasets demonstrate MATE’s efficacy in generating polite and empathetic psychotherapeutic
responses based on clients’ EMS and emotional cues while ensuring contextual consistency and diversity.

Facilitating Early Maladaptive Schema–Guided Polite and Empathetic Psychotherapeutic Support: An LLM-Driven MoE-RL-Based Dialogue System

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel online preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing a high-probability regret bound of order $\widetilde{\mathcal{O}}(d\sqrt{T})$, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.

Bootstrapping LLMs via Preference-Based Policy Optimization

Biological sequences, including RNAs and proteins, share similarities with natural languages, enabling the application of advanced language models to various biological tasks. However, due to its flexibility and lack of experimental data, RNA is a particularly challenging biological ``language'' compared to other biological sequences like proteins. RNA multiple sequence alignments (MSAs), which align evolutionarily related RNA sequences, can greatly enhance RNA biology modeling, as evidenced by their significant roles in structure prediction and function annotation. This raises the question of whether RNA MSAs can also benefit RNA design, which remains unexplored. This paper introduces RMSAGen, a model comprising RMSA-Encoder and RMSA-Decoder, that leverages MSAs to design functional RNA sequences. RMSA-Encoder effectively extracts MSA features, enhancing performance in functional prediction and solvent accessibility prediction tasks and supporting RMSA-Decoder in accurate RNA generation. RMSAGen can design RNA-binding protein sequences that effectively bind to target proteins, and the design performance improves with an increasing number of sequences. In addition, the ribozymes designed with structural features by RMSAGen show superior computational metrics and exhibit biological activity during gel electrophoresis. These results highlight effectiveness of RMSAGen, establishing it as a powerful tool and a new direction for RNA design.

RMSAGen: Integrating Multiple Sequence Alignment for Function RNA Design

Sharpness-aware minimization (SAM) is widely recognized for enhancing the generalization performance of deep neural networks. However, recent works have challenged the statement that flatness implies generalization, demonstrating that it is insufficient as the indicator of generalization \cite{Andriushchenko2023AML,Wen2023SharpnessMA}. In this paper, we reveal an insightful phenomenon: among minima of similar sharpness, stochastic optimization algorithms tend to prefer those with lower nonuniformity. We define nonuniformity by both the magnitude and structure of the gradient noise, and show that it fundamentally differs from sharpness and plays a critical role in generalization. Specifically, we first theoretically prove that the expected generalization gap of models trained via stochastic optimization algorithm is positively correlated with nonuniformity (the magnitude of the gradient noise). Empirically, we show that nonuniformity exhibits a stronger correlation with generalization than sharpness, especially in Transformer models. Furthermore, we demonstrate that the nonuniformity (the structure of the gradient noise) more effectively guides the algorithm towards sparser solutions and exhibits better generalization performance than sharpness-based methods in the high-dimensional sparse regression problem. Finally, extensive experiments on various datasets and models confirm the advantages of nonuniformity for generalization: (1) optimization guided by nonuniformity achieves better generalization compared to those achieved through flatness (including standard training, transfer learning, hyperparameter sensitivity and robustness to label noise); (2) model architecture (such as depth and width) is closely related to nonuniformity.

Beyond Sharpness: The Role of Nonuniformity in Generalization

As LLMs (large language models) are increasingly used to generate synthetic personas, particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek v2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed-methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1,512 LLM-generated persona to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic. Based on these findings, we offer design recommendations for narrative-aware evaluation metrics and community-centered validation protocols for synthetic identity generation.

A Tale of Two Identities: An Ethical Audit of AI-Crafted Synthetic Personas

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM’s policy toward an attacker’s target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model’s feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to accurately identify their specific forms. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model’s performance on downstream tasks.

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with values or guidelines while operating in complex environments. Agents trained solely to achieve their task objectives may adopt harmful behaviors, exposing a key trade-off between maximizing reward and alignment. Avoiding misalignment is particularly difficult for pre-trained agents, where retraining is costly. This is further complicated by the diverse and potentially conflicting attributes representing ethical values. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, and study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.

Content not yet available

Next from AAAI 2026

Towards Understanding Generalization in DP-GD: A Case Study in Training Two-Layer CNNs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES