Singapore

Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking complex logical queries involving first-order logic operations such as conjunction (∧), disjunction (∨), and negation (¬). 
Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset ComLQ for Complex Logical Queries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured structure conformity and evidence distribution through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, Log-Scaled Negation Consistency (LSNC@K). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@K measures whether top-K retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models&#39; limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion. In summary, our ComLQ offers a comprehensive and fine-grained exploration, paving the way for future research on complex logical queries in IR.

AAAI 2026

ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

nlp: other

and evaluation of nlp models

nlp: interpretability

nlp: applications

analysis

Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking complex logical queries involving first-order logic operations such as conjunction (∧), disjunction (∨), and negation (¬). 
Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset ComLQ for Complex Logical Queries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured structure conformity and evidence distribution through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, Log-Scaled Negation Consistency (LSNC@K). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@K measures whether top-K retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion. In summary, our ComLQ offers a comprehensive and fine-grained exploration, paving the way for future research on complex logical queries in IR.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches—especially those based on vector similarity—fail to effectively handle relational structures and multi-step reasoning. In this work, we propose CogGRAG, a human cognition inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG mimics human reasoning through a three-stage process: (1) top-down problem decomposition via mind map construction; (2) structured retrieval of local and global knowledge from external Knowledge Graphs (KGs); and (3) bottom-up reasoning with self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies the entire reasoning process under a global mind map with early-stage, graph-structured retrieval and integrates dual-process verification to mitigate error propagation. Extensive experiments demonstrate that CogGRAG outperforms existing methods in both accuracy and reliability. We provide our code and data here: https://anonymous.4open.science/r/RAG-5883.

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, *catastrophic forgetting* remains a critical challenge. In this work, we identify *attention drift* in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.

Attention Retention for Continual Learning with Vision Transformers

While Vision Language Models (VLMs) excel at understanding videos, their application to hour-long videos is hampered by two intertwined challenges: prohibitive computational costs and a qualitative failure in sustained temporal reasoning. Consequently, models often produce responses based on speculation rather than concrete visual information, leading to both factual inaccuracies and plausible hallucinations. This issue is exacerbated by existing benchmarks that, by focusing only on final answers, lack a rigorous mechanism to verify if reasoning is grounded in specific visual evidence. This makes it difficult to distinguish genuine comprehension from plausible fabrication, hindering targeted model improvement. To address these intertwined challenges of model fallibility and evaluation inadequacy, we propose a two-pronged approach. First, we introduce EV²-Bench, a large-scale benchmark that pioneers an evaluation paradigm centered on spatio-temporal visual evidence, compelling models to justify their answers with verifiable clues. Second, we propose DynamicSelect, an adaptive token compression framework that efficiently distills salient information using a dynamic semantic selector and a hierarchical compression strategy. Extensive experiments show that DynamicSelect substantially outperforms baselines on EV²-Bench and other public benchmarks. Our work provides not only a more effective method for long-video understanding but also a more rigorous evaluation paradigm, highlighting the path toward developing more robust and faithful models.

Seeing Is Believing: Grounding Long-Video Understanding in Spatio-Temporal Visual Evidence

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

Machine learning models now drive many critical decisions, making explanations of their reasoning essential. Recent work analyzes the complexity of exact explanations in transparent models, but these explanations are often too large for practical use. This has motivated research into probabilistic alternatives.

We study probabilistic extensions that allow controlled uncertainty while maintaining rigorous foundations. We analyze three basic model types: decision trees, decision lists, and decision sets. We introduce algorithms for computing both local and global probabilistic explanations for these models. Our main result shows that computing minimum-size probabilistic explanations is fixed-parameter tractable when parameterized by structural properties---specifically, the number of terms for decision lists and decision sets and the minimum of the number of positive and the number of negative leaves.

Computing Probabilistic Explanations for ML Models: Fixed-Parameter Algorithms

Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands'', lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph's rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph's semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimize redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforms existing methods in response quality while reducing 46\% retrieval redundancy. Our code is available at: \url{https://github.com/RaZzzyz/LeanRAG}.

LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Spiking Neural Networks (SNNs) offer a promising energy-efficient computing paradigm owing to their event-driven properties and biologically inspired dynamics. Among various encoding schemes, Time-to-First-Spike (TTFS) is particularly notable for its extreme sparsity, utilizing a single spike per neuron to maximize energy efficiency. However, two significant challenges persist: effectively leveraging TTFS sparsity to minimize training costs on Graphics Processing Units (GPUs), and bridging the performance gap between TTFS-based SNNs and their rate-based counterparts. To address these issues, we propose a parallel training algorithm for accelerated execution and a novel decoding strategy for enhanced performance. Specifically, we derive both forward and backward propagation equations for parallelized TTFS SNNs, enabling precise calculation of first-spike timings and gradients. Furthermore, we analyze the limitations of existing output decoders and introduce a membrane potential–based decoder, complemented by an incremental time-step training strategy, to improve accuracy. Our approach achieves state-of-the-art accuracy for TTFS SNNs on several benchmarks, including MNIST ($99.51\\%$), Fashion-MNIST ($93.14\\%$), CIFAR-10 ($95.06\\%$), and CIFAR-100 ($74.07\\%$). Code and experimental logs are in Supplementary Materials.

Parallel Training Time-to-First-Spike Spiking Neural Networks

Multimodal large reasoning models (MLRMs) have advanced visual-textual integration, enabling sophisticated human-AI interaction. While prior work has exposed MLRMs to visual jailbreaks, it remains underexplored how their reasoning capabilities reshape the security landscape under adversarial inputs. To fill this gap, we conduct a systematic security assessment of MLRMs and uncover a security-reasoning paradox: 
although deeper reasoning boosts cross‑modal risk recognition, it also creates cognitive blind spots that adversaries can exploit. 
We observe that MLRMs oriented toward human-centric service are highly susceptible to users' emotional cues during the deep-thinking stage, often overriding safety protocols or built‑in safety checks under high emotional intensity.
Inspired by this key insight, we propose \textbf{EmoAgent}, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways.
Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) \emph{Risk-Reasoning Stealth Score (RRSS)} for harmful reasoning beneath benign outputs; (2) \emph{Risk-Visual Neglect Rate (RVNR)} for unsafe completions despite visual risk recognition; and (3) \emph{Refusal Attitude Inconsistency (RAIC)} for evaluating refusal unstability under prompt variants.
Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior.
\textbf{ Warning: This paper contains examples that may be offensive or harmful.}

Downloads

Next from AAAI 2026

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES