Singapore

The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model&#39;s weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model&#39;s internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

AAAI 2026

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

nlp: ethics — bias

nlp: safety and robustness

nlp: (large) language models

transparency & privacy

fairness

The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model's weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model's internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Machine unlearning, which enables a model to forget specific data, is crucial for ensuring data privacy and model reliability. However, its effectiveness can be severely undermined in real-world scenarios where models learn unintended biases from spurious correlations within the data. This paper investigates the unique challenges of unlearning from such biased models. We identify a novel phenomenon we term "shortcut unlearning," where models exhibit an "easy to learn, yet hard to forget" tendency. Specifically, models struggle to forget easily-learned, bias-aligned samples; instead of forgetting the class attribute, they unlearn the bias attribute, which can paradoxically improve accuracy on the class intended to be forgotten. To address this, we propose CUPID, a new unlearning framework inspired by the observation that samples with different biases exhibit distinct loss landscape sharpness. Our method first partitions the forget set into causal- and bias-approximated subsets based on sample sharpness, then disentangles model parameters into causal and bias pathways, and finally performs a targeted update by routing refined causal and bias gradients to their respective pathways. Extensive experiments on biased datasets including Waterbirds, BAR, and Biased NICO++ demonstrate that our method achieves state-of-the-art forgetting performance and effectively mitigates the shortcut unlearning problem.

Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias

LLM's code generation capabilities have yielded substantial improvements in the effectiveness of programming tasks. However, LLM-generated code still suffers from compilation and runtime errors. Existing offline preference optimization methods primarily focus on enhancing LLMs' coding abilities using pass/fail signals in the preference data, overlooking the deep-level error types in the failed codes. To address this, we propose Adaptively Progressive Preference Optimization (AP2O) for coding (i.e., AP2O-Coder), a method that guides LLMs adaptively and methodically to reduce code errors for code generation. Specifically, we construct an error notebook from failed codes and progressively optimize the LLM to correct errors type by type. Furthermore, we adaptively replay error types to tailor to the LLM's changing weaknesses throughout the training process. Through extensive experiments on both code and general LLMs (Llama, Qwen, and DeepSeek series) with parameters ranging from 0.5B to 34B, our AP2O-Coder improves code generation performance by up to 3% in pass@k while using less preference data. The code is in the supplementary material.

AP2O-Coder: Adaptively Progressive Preference Optimization for Reducing Compilation and Runtime Errors in LLM-Generated Code

The rapid advancement of generative models has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. Existing AI-generated image detectors often fail to generalize to images from unseen generators when crossing architectural boundaries (i.e., Generative Adversarial Networks (GANs) vs. Diffusion Models (DMs)). We hypothesize that this generalization gap arises from fundamental differences in how these architectures generate images. In this work, we provide the first theoretical analysis explaining why GANs and DMs produce fundamentally different artifacts through the lens of the manifold hypothesis. We prove that GANs produce characteristic boundary artifacts from partial manifold coverage, while DMs exhibit over-smoothing and unique noise patterns due to the need for complete coverage. Motivated by this theoretical finding, we propose a novel semi-supervised detection approach called Triarchy Detector (TriDetect) that enhances standard binary classification with an architecture-aware clustering loss. Specifically, instead of producing binary classification heads, the architecture-aware classifier generates distinct logits for both real images and multiple fake clusters. To prevent the problem of cluster collapse in unsupervised learning scenarios, we implement balanced cluster assignment through the Sinkhorn-Knopp algorithm. Furthermore, we design a cross-view consistency mechanism to ensure that the model learns discriminative features that capture architectural patterns rather than image statistics. By learning to recognize architectural patterns that persist across different generators within the same family, our method achieves superior generalization to unseen generators.

Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason — imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon — "LLMs incline to judge solutions with lower perplexity as correct", which is dubbed as imbalanced evaluation preference. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment.

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Multi-view clustering (MVC) aims to enhance clustering performance by integrating complementary information from diverse sources. Existing deep MVC methods often face trade-offs and compromises between learning shared consensus representations and preserving view-specific characteristics: they either employ separate encoders that limit collaboration or rely on a single shared encoder at the expense of diversity. Recently, Mixture-of-Experts (MoE) models have been introduced to MVC to facilitate cooperation, but their flattened expert pool design leads to entangled shared and specific information, while their routing mechanism overlooks valuable cross-view context. To address these challenges, we propose a novel framework—Decoupled Mixture-of-Experts with Context-Aware Routing (DMCAR). First, we design a Decoupled MoE (D-MoE) architecture comprising a public expert pool for learning shared representations and private expert pools for capturing unique information from each view, structurally enforcing representation decoupling. Second, we introduce a Context-Aware Hierarchical Routing (CAHR) mechanism that leverages a global context vector to guide routing decisions when selecting experts from the shared pool, enabling more intelligent cross-view collaboration. Finally, we adopt a multi-level contrastive learning paradigm, enforcing semantic consistency in shared representations through cross-view alignment loss while promoting decoupling between shared and specific representations via orthogonality constraints. Extensive experiments on multiple benchmark datasets demonstrate that DMCAR significantly outperforms state-of-the-art methods across various clustering metrics and validates the effectiveness of each component in our framework.

DMCAR: Disentangled Mixture-of-Experts with Context-Aware Routing for Multi-View Clustering

Partially observable Markov decision processes (POMDPs) are a central model for uncertainty in sequential decision making. 
The most basic objective is the reachability objective, where a target set must be eventually visited, and the more general parity objectives can model all $\omega$-regular specifications.
For such objectives, the computational analysis problems are the following: 
(a) qualitative analysis that asks whether the objective can be satisfied with probability $1$ (almost-sure winning) or probability arbitrarily close to $1$ (limit-sure winning);
and (b) quantitative analysis that asks for the approximation of the optimal probability of satisfying the objective.
For general POMDPs, almost-sure analysis for reachability objectives is EXPTIME-complete, but limit-sure and quantitative analysises for reachability objectives are undecidable; almost-sure, limit-sure, and quantitative analysises for parity objectives are all undecidable.
A special class of POMDPs, called revealing POMDPs, has been studied recently in several works, and for this subclass the almost-sure analysis for parity objectives was shown to be EXPTIME-complete.
In this work, we show that for revealing POMDPs the limit-sure analysis for parity objectives is EXPTIME-complete, and even the quantitative analysis for parity objectives can be achieved in EXPTIME.

Revealing POMDPs: Qualitative and Quantitative Analysis for Parity Objectives

Deep hashing offers efficient storage and fast retrieval capabilities. As a result, it has been extensively applied to large‑scale retrieval tasks. To alleviate the dependence on high-quality annotated data, recent research has focused on unsupervised domain adaptive hashing methods, which aim to transfer knowledge from a label-rich source domain to a label-scarce target domain. However, in open-world scenarios, source domain labels are often inevitably noisy, which tends to undermine the quality of learned hash codes and induce considerable performance deterioration. To this end, we introduce a novel Robust Domain Adaptive Hashing (RDAH) method to jointly mitigate the adverse effects of label noise and domain discrepancy. Specifically, we first model the loss distribution of training samples using a two-component Gaussian mixture model to estimate each sample’s confidence, based on which the data is partitioned. Subsequently, we introduce a neighbor consistency-guided correction strategy, which leverages the semantic structure of high-confidence neighbors to perform weighted correction on noisy samples. Moreover, we design a dual-level cross-domain alignment mechanism that jointly mitigates domain shift from two complementary perspectives. Extensive experimental results validate the effectiveness and robustness of RDAH across multiple benchmark datasets.

Robust Domain Adaptive Hashing via Structural Noise Modeling and Correction

In some professional sports leagues, inter-league games are scheduled among multiple divisions or conferences. This inspired us to study the $p$-partite Traveling Tournament Problem ($p$-partite TTP), where teams are partitioned into $p$ leagues, and each team plays games against teams from different leagues. Previously, only the case of $p=2$, known as the Bipartite TTP or BTTP, has been introduced and studied (AAAI 2011 and IJCAI 2024).

In this paper, we show that the $p$-partite TTP is NP-hard for any fixed $p \geq 3$, and we propose an efficient algorithm based on a solution to the TSP. Furthermore, we prove that the algorithm achieves a novel approximation ratio of $\frac{8}{3} + O(\frac{1}{n})$ when $p=3$. We also conduct experiments demonstrating that the algorithm produces practical schedules with significantly reduced total travel distances, highlighting its effectiveness in generating high-quality multipartite tournament schedules.

A TSP-Based Algorithm for Multi-League Traveling Tournament

Efficient and lightweight adaptation of pre-trained Vision-Language Models (VLMs) to downstream tasks through collaborative interactions between local clients and a central server is a rapidly emerging research topic in federated learning.
Existing adaptation algorithms are typically trained iteratively, which incur significant communication costs and increase the susceptibility to potential attacks.
Motivated by the one-shot federated training techniques that reduce client-server exchanges to a single round, developing a lightweight one-shot federated VLM adaptation method to alleviate these issues is particularly attractive.
However, current one-shot approaches face certain challenges in adapting VLMs within federated settings: (1) insufficient exploitation of the rich multimodal information inherent in VLMs; (2) lack of specialized adaptation strategies to systematically handle the severe data heterogeneity; and (3) requiring additional training resource of clients or server.
To bridge these gaps, we propose a novel Training-free One-shot Federated Adaptation framework for VLMs, named TOFA. 
To fully leverage the generalizable multimodal features in pre-trained VLMs, TOFA employs both visual and textual pipelines to extract task-relevant representations.
In the visual pipeline, a hierarchical Bayesian model learns personalized, class-specific prototype distributions. For the textual pipeline, TOFA evaluates and globally aligns the generated local text prompts for robustness. An adaptive weight calibration mechanism is also introduced to combine predictions from both modalities, balancing personalization and robustness to handle data heterogeneity.
Our method is training-free, not relying on additional training resources on either the client or server side.
Extensive experiments across 9 datasets in various federated settings demonstrate the effectiveness of the proposed TOFA method.

Downloads

Next from AAAI 2026

Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads