Singapore

Large Language Models (LLMs) often suffer from hallucinations and outdated or incomplete knowledge. Retrieval-Augmented Generation (RAG) is proposed to address these issues by integrating external knowledge like that in knowledge graphs (KGs) into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control.
In this paper, we investigate the \emph{privacy-protected RAG scenario} for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) \emph{How can anonymous entities be converted into retrievable information}. (2) \emph{How to retrieve question-relevant anonymous entities}.

To address these challenges, we propose a novel \textbf{A}bstraction \textbf{R}easoning \textbf{o}n \textbf{G}raph (\textbf{ARoG}) framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. Hence, it supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. In addition to guiding LLMs to effectively retrieve knowledge from KGs, these two abstraction strategies also strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness, establishing a new practical direction for privacy-protected RAG systems. The Code is available at https://github.com/NLPGM/ARoG.

AAAI 2026

Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

nlp: question answering

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. 
Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment. 
In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization.
To address this dilemma, we propose a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). 
Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. 
Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. 
Extensive experiments across both in-distribution and out-of-distribution datasets show that our method outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics. Our code will be released upon acceptance.

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Quantified formulas with Uninterpreted Functions (UFs) over non-linear real arithmetic pose fundamental challenges for Satisfiability Modulo Theories (SMT) solving. Traditional quantifier instantiation methods struggle because they lack semantic understanding of UF constraints, forcing them to search through unbounded solution spaces with limited guidance. We present AquaForte, a framework that leverages Large Language Models to provide semantic guidance for UF instantiation by generating instantiated candidates for function definitions that satisfy the constraints, thereby significantly reducing the search space and complexity for solvers. Our approach preprocesses formulas through constraint separation, uses structured prompts to extract mathematical reasoning from LLMs, and integrates the results with traditional SMT algorithms through adaptive instantiation. AquaForte maintains soundness through systematic validation: LLM-guided instantiations yielding SAT solve the original problem, while UNSAT results generate exclusion clauses for iterative refinement. Completeness is preserved by fallback to traditional solvers augmented with learned constraints. Experimental evaluation on SMT-COMP benchmarks demonstrates that AquaForte solves numerous instances where state-of-the-art solvers like Z3 and CVC5 timeout, with particular effectiveness on satisfiable formulas. Our work shows that LLMs can provide valuable mathematical intuition for symbolic reasoning, establishing a new paradigm for SMT constraint solving.

LLM-Guided Quantified SMT Solving over Uninterpreted Functions

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU.

Unleashing Semantic and Geometric Priors for 3D Scene Completion

Large language model-based multi-agent systems (LLM-MAS) effectively accomplish complex and dynamic tasks through inter-agent communication, but this reliance introduces substantial safety vulnerabilities. Existing attack methods targeting LLM-MAS either compromise agent internals or rely on direct and overt persuasion, which limit their effectiveness, adaptability, and stealthiness. In this paper, we propose MAST, a Multi-round Adaptive Stealthy Tampering framework designed to exploit communication vulnerabilities within the system. MAST integrates Monte Carlo Tree Search with Direct Preference Optimization to train an attack policy model that adaptively generates effective multi-round tampering strategies. Furthermore, to preserve stealthiness, we impose dual semantic and embedding similarity constraints during the tampering process. Comprehensive experiments across diverse tasks, communication architectures, and LLMs demonstrate that MAST consistently achieves high attack success rates while significantly enhancing stealthiness compared to baselines. These findings highlight the effectiveness, stealthiness, and adaptability of MAST, underscoring the need for robust communication safeguards in LLM-MAS.

Attack the Messages, Not the Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS

Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast, humans might create detailed short-term plans and more general, sometimes vague, long-term plans, and adjust them over time. Inspired by this, we propose the Temporal Diffusion Planner (TDP) which improves decision efficiency by distributing the denoising steps across the time dimension. TDP begins by generating an initial plan that becomes progressively more vague over time. At each subsequent time step, rather than generating an entirely new plan, TDP updates the previous one with a small number of denoising steps. This reduces the average number of denoising steps, improving decision efficiency. Additionally, we introduce an automated replanning mechanism to prevent significant deviations between the plan and reality. Experiments on D4RL show that, compared to previous works that generate new plans every time step, TDP significantly improves the decision-making frequency by 11-24.8 times while achieving higher or comparable performance.

Efficient Diffusion Planning with Temporal Diffusion

The sim-to-real gap, where agents trained in a simulator face significant performance degradation during testing, is a fundamental challenge in reinforcement learning. Extensive works adopt the framework of distributionally robust RL, to learn a policy that acts robustly under worst case environment shift. Within this framework, our objective is to devise algorithms that are sample efficient with interactive data collection and large state spaces. By assuming $d$-rectangularity of environment dynamic shift, we identify a fundamental hardness result for learning in online Markov game, and address it by adopting minimum value assumption. Then, a novel least square value iteration type algorithm, DR-CCE-LSI, with exploration bonus devised specifically for multiple agent, is proposed to find an $\\varepsilon-$approximate robust Coarse Correlated Equilibrium(CCE). To obtain sample efficient learning, we find that: when the feature mapping function satisfies certain properties, our algorithm, DR-CCE-LSI, is able to achieve $\\epsilon-$approximate CCE with a regret bound of $\\mathcal{O}\\{dH\\min\\{H,\\frac{1}{\\min\\{\\sigma_i\\}}\\}\\sqrt{K}\\}$, where $K$ is the number of interacting episodes, $H$ is the horizon length, $d$ is the feature dimension, and $\\sigma_i$ represents the uncertainty level of player $i$. Our work introduces the first sample-efficient algorithm for this setting, matches the best result so far in single agent setting, and achieves minimax optimal sample complexity in terms of the feature dimension $d$. Meanwhile, we also conduct simulation study to validate the efficacy of our algorithm in learning a robust equilibrium.

Distributionally Robust Online Markov Game with Linear Function Approximation

Multimodal keyphrase generation (MKP) aims to extract a concise set of keyphrases that capture the essential meaning of paired image–text inputs, enabling structured understanding, indexing, and retrieval of multimedia data across the web and social platforms. Success in this task demands effectively bridging the semantic gap between heterogeneous modalities. While multimodal large language models (MLLMs) achieve superior cross-modal understanding by leveraging massive pretraining on image-text corpora, we observe that they often struggle with modality bias and fine-grained intra-modal feature extraction. This oversight leads to a lack of robustness in real-world scenarios where multimedia data is noisy, along with incomplete or misaligned modalities. To address this problem, we propose AimKP, a novel framework that explicitly reinforces intra-modal semantic learning in MLLMs while preserving cross-modal alignment. AimKP incorporates two core innovations: (i) Progressive Modality Masking, which forces fine-grained feature extraction from corrupted inputs by progressively masking modality information during training; (ii) Gradient-based Filtering, that identifies and discards noisy samples, preventing them from corrupting the model’s core cross-modal learning. Extensive experiments validate AimKP’s effectiveness in multimodal keyphrase generation and its robustness across different scenarios.

Augmenting Intra-Modal Understanding in MLLMs for Robust Multimodal Keyphrase Generation

Ring artifacts are prevalent in 3D cone-beam computed tomography (CBCT) due to non-ideal responses of X-ray detectors, substantially affecting image quality and diagnostic reliability. Existing state-of-the-art (SOTA) ring artifact reduction (RAR) methods rely on supervised learning with large-scale paired CT datasets. While effective in-domain, supervised methods tend to struggle to fully capture the physical characteristics of ring artifacts, leading to pronounced performance drops in complex real-world acquisitions. Moreover, their scalability to 3D CBCT is limited by high memory demands. In this work, we propose Riner, a new unsupervised RAR method. Based on a theoretical analysis of ring artifact formation, we reformulate RAR as a multi-parameter inverse problem, where the non-ideal responses of X-ray detectors are parameterized as solvable physical variables. Using a new differentiable forward model, Riner can jointly learn the implicit neural representation of artifact-free images and estimate the physical parameters directly from CT measurements, without external training data. Additionally, Riner is memory-friendly due to its ray-based optimization, enhancing its usability in large-scale 3D CBCT. Experiments on both simulated and real-world datasets show Riner outperforms existing SOTA supervised methods. The code will be publicly released for improving reproducibility.

Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT

Internet memes serve as widely distributed multimodal social content that conveys complex ideas through metaphorical expressions, often containing harmful implications that make accurate harmful meme detection an important problem. Reasoning knowledge extracted from large language models plays a crucial role in recent advances in harmful meme detection. However, these methods only perform reasoning analysis on memes from a single opinion, ignoring that memes are essentially products of group consensus, where their true meaning interpretation highly depends on the collision and aggregation process of diverse user viewpoints. To address this problem, we propose a Social Graph of Thought Reasoning Enhancement (SGoTRE) framework for harmful meme detection. The SGoTRE contains three key steps: First, through multi-agent simulation technology, we obtain diverse chains of thought that represent the parsing logic of users from different backgrounds toward memes, authentically restoring the diversity characteristics of group cognition. Second, we construct a Social Graph of Thought (SGoT) that effectively integrates multi-chain reasoning processes and structurally expresses the consensus and diversity of viewpoints among users. Finally, we utilize the SGoT for cognitive distillation, internalizing multi-opinion reasoning logic into a single multimodal large model SGoT-R1 to achieve efficient and interpretable harmful meme detection. Experimental results show that SGoT-R1 significantly improves detection performance on mainstream datasets. Particularly on the most challenging FHM dataset, SGoT-R1 achieves an 8.9% improvement over state-of-the-art models.

SGoT-R1: Social Graph of Thought Reasoning-Enhanced Multimodal Large Language Model for Harmful Meme Detection

Text-to-image person re-identification (TIReID) aims to retrieve the most relevant pedestrian images from an image gallery based on natural language descriptions. Recent studies have achieved significant performance improvements by leveraging Masked Language Modeling (MLM) to align fine-grained information through local matching. However, in the text feature extraction, randomly masking text tokens may disrupt the semantic relationships between these local tokens, leading to feature misalignment; on the other hand, from an image feature perspective, redundant patches in pedestrian images hinder the information interaction across modalities. Moreover, the presence of noisy image-text pairs further complicates the learning process, as the model may be misled into recognizing incorrect patterns. To address these issues, we propose a robust fine-grained local alignment framework based on Key Phrase Dynamic Mask (KPDM). First, we strengthen the semantic relationships between text tokens by implementing a "adjective + noun" phrase-level masking strategy, mitigating local misalignment. Additionally, we integrate cross-layer importance estimation to highlight key pedestrian image representations while removing redundant image features. Building on this, we design a novel frequency-based masked language loss (FMLM) to supervise fine-grained semantic-level local alignment. Second, we propose a trusted consensus partitioning mechanism, utilizing intra-identity image-text similarity distributions to identify noisy pairs, enhancing the model robustness. Extensive experiments show that our method achieves 67.95\% Rank-1 and 51.88\% mAP on the RSTPReid dataset, exceeding the previous state-of-the-art by 2.6\% and 1\%. Furthermore, KPDM achieves Rank-1 accuracies of 75.97\% on the CUHK-PEDES dataset and 67.78\% on the ICFG-PEDES dataset, outperforming earlier methods.

Content not yet available

Next from AAAI 2026

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES