Singapore

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of adaptive defense against diverse jailbreaks. We propose a new concept &quot;mirror&#39;&#39;, which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

AAAI 2026

MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

nlp: conversational ai/dialog systems

nlp: safety and robustness

nlp: prompt engineering / prompting

nlp: (large) language models

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of adaptive defense against diverse jailbreaks. We propose a new concept "mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reconstructing dynamic scenes has long been a challenging task in 3D vision. Previous mainstream methods based on 3D Gaussian Splatting typically employ a single deformation field to directly model spatiotemporal changes. However, such one-step deformation struggles to capture diverse and complex motion patterns. To address this limitation, we propose decomposing the one-step deformation into a multi-step process, where each step is represented by a deformation layer. Additionally, we introduce a weight prediction mechanism for each layer to control the extent of deformation at every step. We provide two types of deformation layers based on implicit and explicit approaches. Moreover, while the deformation layer is time-conditioned, the Gaussians' behavior may still be influenced by their time-invariant properties. Therefore, we propose a fully time-agnostic scale modulation block to modulate the scaling changes of Gaussians. Extensive experiments on D-NeRF, Neu3D, and HyperNeRF demonstrate that our method achieves state-of-the-art performance.

Multi-Step Deformable Gaussian Splatting for Dynamic Scene Rendering

Meme is an expressive medium that often conveys rich emotions and intentions. Recent studies have confirmed the critical role of metaphors in meme understanding. However, existing metaphor research heavily relies on manual annotations, and mainstream vision-language models (VLMs) still struggle with the recognition and comprehension of metaphors. To address these challenges, we introduce MetaGPT, the first vision-language model specifically designed for meme metaphor understanding. MetaGPT is capable of identifying and extracting metaphors in memes, and generating accurate meme interpretations. Furthermore, we construct a dedicated dataset for meme understanding, MUnd, which comprises approximately 32,000 high-quality question-answer (QA) pairs across three core tasks: metaphor detection, metaphor domain extraction, and meme interpretation. Based on MUnd, we further propose an evaluation benchmark for meme understanding and conduct a comprehensive assessment of existing VLMs. Experimental results reveal that current models still face challenges in metaphor comprehension, while MetaGPT consistently outperforms them across all tasks, highlighting its potential in advancing meme understanding.

MetaGPT: A Large Vision-Language Model for Meme Metaphor Understanding

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in visual classification tasks. Existing methods for enhancing VLMs on this task often rely heavily on direct category-to-image matching, which limits generalization and results in suboptimal performance. In addition, these methods provide no understanding of why a specific category is chosen. To address these limitations, we introduce a new deliberative visual classification task that decomposes the classification process into multiple deliberative steps and leverages Large Language Models (LLMs) to perform explicit reasoning before the final decision. Specifically, we propose a Retrieval-driven Reasoning model (RdR) with two components, i.e., retrieval database construction and deliberative category prediction. The first component leverages LLMs to extract category-relevant descriptors and constructs a retrieval database for effective image–descriptor matching. The second component facilitates multiple deliberative steps and performs explicit reasoning based on the retrieved descriptors to augment the category prediction. Extensive experiments on multiple datasets demonstrate that RdR consistently outperforms strong baselines, highlighting its robustness and generalization ability.

Retrieval-driven Reasoning for Deliberative Visual Classification

Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking complex logical queries involving first-order logic operations such as conjunction (∧), disjunction (∨), and negation (¬). 
Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset ComLQ for Complex Logical Queries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured structure conformity and evidence distribution through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, Log-Scaled Negation Consistency (LSNC@K). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@K measures whether top-K retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion. In summary, our ComLQ offers a comprehensive and fine-grained exploration, paving the way for future research on complex logical queries in IR.

ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

Large Language Models (LLMs) have demonstrated significant potential across various domains. However, they often struggle with integrating external knowledge and performing complex reasoning, leading to hallucinations and unreliable outputs. Retrieval Augmented Generation (RAG) has emerged as a promising paradigm to mitigate these issues by incorporating external knowledge. Yet, conventional RAG approaches—especially those based on vector similarity—fail to effectively handle relational structures and multi-step reasoning. In this work, we propose CogGRAG, a human cognition inspired, graph-based RAG framework designed for Knowledge Graph Question Answering (KGQA). CogGRAG mimics human reasoning through a three-stage process: (1) top-down problem decomposition via mind map construction; (2) structured retrieval of local and global knowledge from external Knowledge Graphs (KGs); and (3) bottom-up reasoning with self-verification. Unlike previous tree-based decomposition methods such as MindMap or Graph-CoT, CogGRAG unifies the entire reasoning process under a global mind map with early-stage, graph-structured retrieval and integrates dual-process verification to mitigate error propagation. Extensive experiments demonstrate that CogGRAG outperforms existing methods in both accuracy and reliability. We provide our code and data here: https://anonymous.4open.science/r/RAG-5883.

Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving

While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, *catastrophic forgetting* remains a critical challenge. In this work, we identify *attention drift* in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.

Attention Retention for Continual Learning with Vision Transformers

While Vision Language Models (VLMs) excel at understanding videos, their application to hour-long videos is hampered by two intertwined challenges: prohibitive computational costs and a qualitative failure in sustained temporal reasoning. Consequently, models often produce responses based on speculation rather than concrete visual information, leading to both factual inaccuracies and plausible hallucinations. This issue is exacerbated by existing benchmarks that, by focusing only on final answers, lack a rigorous mechanism to verify if reasoning is grounded in specific visual evidence. This makes it difficult to distinguish genuine comprehension from plausible fabrication, hindering targeted model improvement. To address these intertwined challenges of model fallibility and evaluation inadequacy, we propose a two-pronged approach. First, we introduce EV²-Bench, a large-scale benchmark that pioneers an evaluation paradigm centered on spatio-temporal visual evidence, compelling models to justify their answers with verifiable clues. Second, we propose DynamicSelect, an adaptive token compression framework that efficiently distills salient information using a dynamic semantic selector and a hierarchical compression strategy. Extensive experiments show that DynamicSelect substantially outperforms baselines on EV²-Bench and other public benchmarks. Our work provides not only a more effective method for long-video understanding but also a more rigorous evaluation paradigm, highlighting the path toward developing more robust and faithful models.

Seeing Is Believing: Grounding Long-Video Understanding in Spatio-Temporal Visual Evidence

Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.

RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection

Existing tool-augmented large language models (LLMs) encounter significant challenges when processing complex queries. Current frameworks such as ReAct are prone to local optimization traps due to their reliance on incremental decision-making processes. To address these limitations, we propose a novel Planner-centric Plan-Execute paradigm that fundamentally resolves local optimization bottlenecks through architectural innovation. Central to our approach is a novel Planner model that performs global Directed Acyclic Graph (DAG) planning for complex queries, enabling optimized execution beyond conventional tool coordination. We also introduce ComplexTool-Plan, a large-scale benchmark dataset featuring complex queries that demand sophisticated multi-tool composition and coordination capabilities. Additionally, we develop a two-stage training methodology that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), systematically enhancing the Planner's tool selection accuracy and global planning awareness through structured DAG-based planning. When integrated with a capable executor, our framework achieves state-of-the-art performance on the StableToolBench benchmark for complex user queries, demonstrating superior end-to-end execution capabilities and robust handling of intricate multi-tool workflows.

Downloads

Next from AAAI 2026

Multi-Step Deformable Gaussian Splatting for Dynamic Scene Rendering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Multi-Step Deformable Gaussian Splatting for Dynamic Scene Rendering

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads