Singapore

With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing private and sensitive data to cloud LLMs. Conventional techniques like homomorphic encryption (HE), secure multi-party computation, and federated learning (FL) are not well-suited to this scenario due to the lack of control over user participation in remote model interactions. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is &quot;anti-adversarial&quot; learning, which perturbs sensitive words in the prompt to obscure private information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is utilized to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance. Our code is publicly available at https://anonymous.4open.science/r/PromptObfus-BF36/.

AAAI 2026

Anti-adversarial Learning: Desensitizing Prompts for Large Language Model

prompt engineering / prompting

safety and robustness

(large) language models

With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing private and sensitive data to cloud LLMs. Conventional techniques like homomorphic encryption (HE), secure multi-party computation, and federated learning (FL) are not well-suited to this scenario due to the lack of control over user participation in remote model interactions. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is "anti-adversarial" learning, which perturbs sensitive words in the prompt to obscure private information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is utilized to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance. Our code is publicly available at https://anonymous.4open.science/r/PromptObfus-BF36/.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions, such as what has been interacted with, where it has been, or how it has changed, visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, naïve vision-language-action (VLA) models struggle in such settings, with token scaling quickly becoming intractable-even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general benchmarks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

Video generative models pre-trained on large-scale internet datasets have achieved remarkable success, excelling at producing realistic synthetic videos. However, they often generate clips based on static prompts (e.g., text or images), limiting their ability to model interactive and dynamic scenarios. In this paper, we propose \textbf{D}ynamic \textbf{W}orld \textbf{S}imulation (DWS), a novel approach to transform pre-trained video generative models into controllable world simulators capable of executing specified action trajectories. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module that seamlessly integrates into any existing model. Instead of focusing on complex visual details, we demonstrate that consistent dynamic transition modeling is the key to building powerful world simulators. Building upon this insight, we further introduce a motion-reinforced loss that enhances action controllability by compelling the model to capture dynamic changes more effectively. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models, achieving significant improvements in generating action-controllable, dynamically consistent videos across games and robotics domains. Moreover, to facilitate the applications of the learned world simulator in downstream tasks such as model-based reinforcement learning, we propose prioritized imagination to improve sample efficiency, demonstrating competitive performance compared with state-of-the-art methods.

Pre-Trained Video Generative Models as World Simulators

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning models with human preferences. However, existing DPO-based methods suffer from three key drawbacks: they rely on only a single positive-negative preference pair per question, restricting the diversity and richness of feedback; they often emphasize minimizing negative preference scores while neglecting to strengthen the positive preferences; and they depend on either human-annotated preferences or expert model outputs - both xpensive and difficult to scale. Moreover, the deterministic ranking assumptions of recent Group-based preference optimization methods break down in open-ended tasks such as Visual Question Answering (VQA), where multiple answers can be equally plausible but differ subtly in relevance or specificity. Given this subtle variance in preferences, we propose to perform ranking over groups of preferences rather than relying on fine-grained ranking of individual ones, which is often noisy and subjective. To address these challenges, we introduce Self-Supervised Visual Preference Alignment via Differentiable Multi-Preference Multi-Group Ranking (SMPRO), a novel framework that (1) self-generates rich, diverse preference groups while eliminating the need for external annotations, (2) employs a fully differentiable ranking objective based on sorting networks to capture nuanced preference gradients across arbitrary numbers of preferences both within and across these groups, and (3) incorporates multiple positive preferences to enrich the positive preference group, capturing subtle distinctions among high-quality preferences. Extensive experiments across diverse visual tasks demonstrate that our approach achieves state-of-the-art performance in self-supervised setting. Specifically, our model surpasses existing baselines, achieving notable improvements such as 82.4% on MMBench, 63.2% on MM-Star, 94.6% on LLaVA-W, and 81.9% on AI2D. These results underscore the effectiveness of our approach in capturing richer preference signals and demonstrate its scalability for open-ended, ambiguous VQA tasks.

SMPRO: Self-Supervised Visual Preference Alignment via Differentiable Multi-Preference Multi-Group Ranking

Large language models have gained widespread attention recently, but their potential security vulnerabilities, especially privacy leakage, are also becoming apparent. To test and evaluate for data extraction risks in LLM, we proposed CoSPED, short for Consistent Soft Prompt targeted data Extraction and Defense. We introduce several innovative components, including Dynamic Loss, Additive Loss, Common Loss, and Self Consistency Decoding Strategy, and tested to enhance the consistency of the soft prompt tuning process. Through extensive experimentation with various combinations, we achieved an extraction rate of 65.2% at a 50-token prefix comparison. Our comparisons of CoSPED with other reference works confirm our superior extraction rates. We evaluate CoSPED on more scenarios, achieving Pythia model extraction rate of 51.7% and introducing cross-model comparison. Finally, we explore defense through Rank-One Model Editing and achieve a reduction in the extraction rate to 1.6%, which proves that our analysis of extraction mechanisms can directly inform effective mitigation strategies against soft prompt-based attacks.

CoSPED: Consistent Soft Prompt Targeted Data Extraction and Defense

In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context---critical elements for situational comprehension and reliable deployment across real-world applications.
Toward that goal, we curate MegaCoin, a high-quality, human-labeled dataset based on \emph{real} images with various contextual attributes. MegaCoin consists of two parts: MegaCoin-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCoin-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCoin provides three annotated features for 220,000 real images: foreground color, background color, and description of an object's physical environment, constituting 660k human annotations. In addition, MegaCoin can be applied to benchmark domain generalization (DG) algorithms. We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. Last but not least, we show that VLMs, including GPT-4o, have subpar color recognition capabilities, and fine-tuning with MegaCoin can result in improved performance on visual evaluation tasks. In certain cases, MegaCoin fine-tuned small-scale open-source models such as LLaVA and Bunny can outperform closed-source GPT-4o. We hope the utilities of MegaCoin can shed light on the directions VLMs can improve and provide a more complex platform for domain generalization algorithms.

MegaCoin: Enhancing Medium-Grained Color Perception for Vision-Language Models

The increasing complexity of modern AI systems exposes a significant assurance gap: safety evidence from practices like red-teaming and robustness testing remains fragmented, lacking a formal mechanism for composition and propagation throughout the development lifecycle. This prevents the construction of rigorous, dynamic safety cases essential for trustworthy AI. We introduce the Composable Assurance Framework (CAF), a novel engineering methodology that integrates safety assurance directly into MLOps workflows. At its core is the Formal Safety Assertion (FSA), a standardized, machine-readable structure that verifiably links safety properties—such as robustness scores or the absence of deceptive circuits—to specific AI artifacts. We then define a Composition Calculus, a set of formal rules governing how FSAs are propagated and aggregated as components are combined into a system. This approach transforms the development pipeline into an automated evidence-gathering engine, whose output is a dynamic Directed Acyclic Graph (DAG) of assertions that constitutes a living safety case. Through a prototype and a Retrieval-Augmented Generation (RAG) case study, we demonstrate how CAF automatically enforces a predefined safety policy, blocking non-compliant deployments.

Composable Assurance for AI Alignment: A Framework for Propagating Formal Safety Properties Through MLOps

Large language models are widely used, but aligning them with societal values remains challenging. Current approaches often rely on human annotations, which are hard to scale, or synthetic data produced by models that may themselves be misaligned, making it difficult to capture genuine public opinion. This limits scalability and introduces demographic biases that reduce the representativeness and fairness of model behavior. We introduce a novel approach to pluralistic alignment through behavioral learning, grounded in the psychological principle that actions (behavior) have strong consistency with opinions. Specifically, we present ALPHA50M, a dataset of over 50 million samples derived from 1.5 million real-world advertisements, incorporating rich behavioral signals inferred from demographic engagement patterns. Models trained on this data achieve state-of-the-art zero-shot performance on diverse alignment benchmarks spanning cultural reasoning, political views, and social values. We also propose two new benchmarks: OpinionQA-XL, which covers surveys across 100+ societal topics, and GSS, which evaluates temporal opinion shift modeling over decades. Our results demonstrate that learning from behavioral signals, derived from observed human actions, enables models to align with diverse demographic opinions, capture underlying social and cultural norms, and generalize to new topics and surveys beyond training data. This behavioral learning paradigm offers a scalable and demographically broad alternative to existing alignment techniques.

ALPHA: Action-Based Learning for Pluralistic Human Alignment in Large Language Models

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 36% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. The data and code will be available.

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and a complete mitigation remains challenging despite promising initial results.

Mitigating Self-Preference by Authorship Obfuscation

Large language models (LLMs) have been shown to exhibit social bias however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding whether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian- American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM’s respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4% , 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.

Downloads

Next from AAAI 2026

Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES