Singapore

Query rewriting is a crucial task for improving retrieval, especially in professional domains such as law and medicine, where user queries are often underspecified and ambiguous. While large language models (LLMs) offer strong understanding and generation capabilities, existing LLM-based approaches reduce the task to text transformation or expansion, neglecting reasoning to disambiguate queries, which fails to bridge the cognitive gap between user queries and specialized documents. In this paper, we propose Think-Then-Rewrite (TTR), a reinforcement learning based framework that unleashes LLMs&#39; reasoning ability for domain-specific query rewriting. TTR introduces a contrastive mutual information reward to encourage the LLM to generate reasoning processes that effectively distinguish confusing distractors. To boost early-stage training, TTR also constructs golden query rewrites as off‑policy data, providing strong guidance for RL learning. A mixed-policy optimization then combines on-policy and off-policy signals, ensuring both effectiveness and stability. Extensive experiments on legal and medical retrieval benchmarks demonstrate that TTR achieves state-of-the-art performance.

AAAI 2026

Think Then Rewrite: Reasoning Enhanced Query Rewriting for Domain Specific Retrieval

dmkm: conversational systems for recommendation & retri

krr: knowledge acquisition

peai: ai & law

regulation & governance

nlp: (large) language models

nlp: applications

justice

Query rewriting is a crucial task for improving retrieval, especially in professional domains such as law and medicine, where user queries are often underspecified and ambiguous. While large language models (LLMs) offer strong understanding and generation capabilities, existing LLM-based approaches reduce the task to text transformation or expansion, neglecting reasoning to disambiguate queries, which fails to bridge the cognitive gap between user queries and specialized documents. In this paper, we propose Think-Then-Rewrite (TTR), a reinforcement learning based framework that unleashes LLMs' reasoning ability for domain-specific query rewriting. TTR introduces a contrastive mutual information reward to encourage the LLM to generate reasoning processes that effectively distinguish confusing distractors. To boost early-stage training, TTR also constructs golden query rewrites as off‑policy data, providing strong guidance for RL learning. A mixed-policy optimization then combines on-policy and off-policy signals, ensuring both effectiveness and stability. Extensive experiments on legal and medical retrieval benchmarks demonstrate that TTR achieves state-of-the-art performance.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem.
However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities.
In this paper, we first introduce a novel security threat, which we term the MCP Preference Manipulation Attack (MPMA).
An attacker deploys a customized MCP server to manipulate LLMs, causing them to prioritize it over other competing MCP servers.
This can result in economic benefits for attackers, such as revenue from paid MCP services or advertising income generated from free servers.
To achieve MPMA, we first design a Direct Preference Manipulation Attack ($\mathtt{DPMA}$) that achieves significant effectiveness by inserting the manipulative word and phrases into the tool name and description. 
However, such a direct modification is obvious to users and lacks stealthiness.
To address these limitations, we further propose Genetic-based Advertising Preference Manipulation Attack ($\mathtt{GAPMA}$). 
$\mathtt{GAPMA}$ employs four commonly used strategies to initialize descriptions and integrates a Genetic Algorithm (GA) to enhance stealthiness.
The experiment results demonstrate that \NameG balances high effectiveness and stealthiness.
Our study reveals a critical vulnerability of the MCP in open ecosystems, highlighting an urgent need for robust defense mechanisms to ensure the fairness of the MCP ecosystem.

MPMA: Preference Manipulation Attack Against Model Context Protocol

Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety remains a major challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this gap, we propose EduGuardBench, a dual-component benchmark that evaluates professional fidelity through the Role-playing Fidelity Score (RFS) and diagnoses harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and academic misconduct, with metrics such as Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally demonstrate higher fidelity, incompetence remains the dominant failure mode across most models. Adversarial testing uncovered a counterintuitive scaling paradox, where mid-sized models appear more vulnerable, challenging monotonic safety assumptions. Notably, we identify an Educational Transformation Effect, where the safest models convert harmful requests into teachable moments through ideal educational refusals. This ability is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework for holistic assessment of professional, ethical, and pedagogical alignment, uncovering dynamics critical to deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for materials.

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods.

Multi-agent In-context Coordination via Decentralized Memory Retrieval

Personalization, while extensively studied in conventional autonomous driving pipelines, has been largely overlooked in the context of end-to-end autonomous driving (E2EAD), despite its critical role in fostering user trust, safety perception, and real-world adoption. A primary bottleneck is the absence of large-scale real-world datasets that systematically capture driving preferences, severely limiting the development and evaluation of personalized E2EAD models. In this work, we introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD, integrating comprehensive scene topology with rich dynamic context derived from agent dynamics and semantics inferred via a fine-tuned vision-language model (VLM). We propose a hybrid annotation pipeline that combines behavioral analysis, rule-and-distribution-based heuristics, and subjective semantic modeling guided by VLM reasoning, with final refinement through human-in-the-loop verification. Building upon this dataset, we introduce the first standardized benchmark for systematically evaluating personalized E2EAD models. Empirical evaluations on state-of-the-art architectures demonstrate that incorporating personalized driving preferences significantly improves behavioral alignment with human demonstrations.

StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving

Deep unfolding networks (DUNs) have recently emerged as a promising approach for hyperspectral image super-resolution (HSISR) by combining the benefits of nonlinear deep learning architectures with interpretable optimization techniques. Despite their advantages, current DUNs face significant challenges, particularly in approximating degradation matrices across both spatial and spectral dimensions, which results in complex and cumbersome model construction. By analyzing the difference between the upsampled low-resolution hyperspectral images (LRHS) and the true target image, we observed that the residual image exhibits strong sparsity, akin to noise. Leveraging this insight, we reformulate the HSISR problem as a robust principal component analysis (RPCA)-based denoising task, effectively eliminating the need for the complex approximation of spatial degradation matrix and its transpose. In addition, we introduce a Tensor Ring Transformer based on multilinear products as the prior term, wherein tokens are mapped to a tensor ring factor domain and the traditional dot product is replaced with a multilinear tensor ring product. This significantly reduces the computational complexity of the Transformer model, from $ \mathcal{O}(N^2d) $ to $ \mathcal{O}(Nr^2) $, with $ r<<d $, while maintaining the expressive power. The proposed Tensor Ring Transformer integrates both Softmax and linear attention mechanisms, striking a balance between interpretability—characteristic of model-based approaches—and the efficiency inherent in deep learning techniques. Experimental results across multiple remote sensing datasets demonstrate the superiority of the designed Tensor Ring Transformer, achieving substantial improvements in image quality and computational efficiency compared to current state-of-the-art methods.

TRT: Harnessing Tensor Ring Transformer for Hyperspectral Image Super-Resolution

The forecasting of irregular multivariate time series (IMTS) is a critical task in domains like healthcare and climate science. However, this task faces two significant hurdles: 1) the inherent non-uniformity and missing data in IMTS complicate the modeling of temporal dynamics, and 2) existing methods often rely on computationally expensive architectures. To address these dual challenges, we introduce APN, a general and efficient forecasting framework. At the core of APN is a novel Time-Aware Patch Aggregation (TAPA) module that introduces an aggregation-based paradigm for adaptive patching, moving beyond the limitations of fixed-span segmentation and interpolation-based methods. TAPA first learns dynamic temporal boundaries to define data-driven segments. Crucially, instead of resampling or interpolating, it directly computes patch representations via a time-aware weighted aggregation of all raw observations, where weights are determined by each observation's temporal relevance to the segment. This approach provides two key advantages: it preserves data fidelity by avoiding the introduction of artificial data points and ensures complete information coverage by design.The resulting regularized and information-rich patch representations enable the use of a lightweight query module for historical context aggregation and a simple MLP for final prediction. Extensive experiments on multiple real-world datasets demonstrate that APN establishes a new state-of-the-art, significantly outperforming existing methods in both prediction accuracy and computational efficiency.

Rethinking Irregular Time Series Forecasting: A Simple Yet Effective Baseline

The Gromov--Wasserstein (GW) distance and its fused extension (FGW) are powerful tools for comparing heterogeneous data. Their computation is, however, challenging since both distances are based on non-convex, quadratic optimal transport (OT) problems. Leveraging 1D OT, a sliced version of GW has been proposed to lower the computational burden. Unfortunately, this sliced version is restricted to Euclidean geometry and loses invariance to isometries, strongly limiting its application in practice. To overcome these issues, we propose a novel slicing technique for GW as well as for FGW that is based on an appropriate lower bound, hierarchical OT, and suitable quadrature rules for the underlying 1D OT problems. Our novel sliced FGW significantly reduces the numerical effort while remaining invariant to isometric transformations and allowing the comparison of arbitrary geometries. We show that our new distance actually defines a pseudo-metric for structured spaces that bounds FGW from below and study its interpolation properties between sliced Wasserstein and GW. Since we avoid the underlying quadratic program, our sliced distance is numerically more robust and reliable than the original GW and FGW distance; especially in the context of shape retrieval and graph isomorphism testing.

A Novel Sliced Fused Gromov-Wasserstein Distance

Multiple clustering aims to uncover diverse latent structures within the data, enabling a more comprehensive understanding of complex datasets. However, existing approaches either heavily rely on user-supplied keywords or disregard user-interested clustering types, limiting the ability to discover the full range of explainable clusterings of interests, particularly in high-dimensional settings. Furthermore, existing methods insufficiently leverage the rich textual semantics and fall short in fully integrating multi-modal information.
To address these challenges, we propose MLLM enriched Multiple Clustering (MLLM_{MC}), a novel framework that leverages multi-modal large language model (MLLM) to explore explainable non-redundant clustering. Specifically, MLLM_{MC} first employs MLLM to generate sample descriptions, which serve as input for LLM to perform prompt-driven reasoning and infer latent clustering types, and then merges them with user-interested types to obtain diverse and explainable clustering types. For each selected type, MLLM_{MC} utilizes MLLM to generate sample-level textual descriptions and aligns them with corresponding visual features through a cross-attention fusion module, which produces a semantically aligned and enriched representation for the target clustering type. Extensive experiments on six benchmark datasets from diverse domains demonstrate that MLLM_{MC} achieves diverse, explainable, and high-quality clustering outcomes, outperforming state-of-the-art multiple clustering methods with a large margin.

MLLM Enriched Explainable Multiple Clustering

In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce MultiMood, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and produces responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying multimodal framework in this domain.

Reinforce Trustworthiness in Multimodal Emotional Support System

Precise estimation and uncertainty quantification for average crop yields are critical for agricultural monitoring and decision making. Existing data collection methods, such as crop cuts in randomly sampled fields at harvest time, are relatively time-consuming. Thus, we propose a novel approach based on prediction-powered inference (PPI) to supplement these crop cuts with less time-consuming field photos. After training a computer vision model to predict the ground truth crop cut yields from the photos, we learn a "control function" that recalibrates these predictions with the spatial coordinates of each field. This enables fields with photos but not crop cuts to be leveraged to improve the precision of zone-wide average yield estimates. Our control function is learned by training on a dataset of nearly 20,000 real crop cuts and photos of rice and maize fields in sub-Saharan Africa. To improve precision, we pool training observations across different zones within the same first-level subdivision of each country. Our final PPI-based point estimates of the average yield are provably asymptotically unbiased and cannot increase the asymptotic variance beyond that of the natural baseline estimator --- the sample average of the crop cuts --- as the number of fields grows. We also propose a novel bias-correlated and accelerated (BCa) bootstrap to construct accompanying confidence intervals. Even in zones with as few as 20 fields, the point estimates show significant empirical improvement over the baseline, increasing the effective sample size by as much as 73\% for rice and by 12-23\% for maize. The confidence intervals are accordingly shorter at minimal cost to empirical finite-sample coverage. This demonstrates the potential for relatively low-cost images to make area-based crop insurance more affordable and thus spur investment into sustainable agricultural practices.

Downloads

Next from AAAI 2026

MPMA: Preference Manipulation Attack Against Model Context Protocol

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES