Singapore

Driven by advances in GANs and diffusion models, deepfake content has reached an unprecedented level of photorealism, causing detectors to deteriorate once they leave their training domain. Most prior studies adopt CLIP as the backbone of an image-level binary classifier, yet overlook CLIP’s core strength: text-to-image semantic alignment. Moreover, captions generated by CLIP-CAP lack sufficient high-level semantics to distinguish between authentic and manipulated faces. Deepfake generators often fail to maintain semantic coherence, resulting in contradictions that traditional visual models cannot capture. Existing approaches also intermingle all samples during training and thus lack a systematic, difficulty-aware curriculum. To bridge these gaps, we introduce Semantic- and Frequency-Enhanced (SAFE) deepfake detection, a two-component framework: 1) Semantic-enhanced multimodal alignment. Authenticity cues are injected into CLIP-CAP captions, and low-rank LoRA fine-tuning is applied to CLIP’s visual branch, yielding dual supervision for text–image alignment and forgery discrimination. 2) Dual-score curriculum learning. Fourier Correlation Variance (FCV) measures local spectral consistency and, combined with the loss value, is transformed into a difficulty score that ranks training samples from easy to hard, reducing training time by 23.3\% and enhancing generalization. SAFE attains state-of-the-art performance on several cross-dataset and cross-manipulation benchmarks. Ablation studies confirm that semantic enhancement, LoRA fine-tuning, and dual-score curriculum are complementary, jointly delivering substantial gains in open-set generalization.

AAAI 2026

SAFE: Semantic- and Frequency-Enhanced Curriculum for Cross-Domain Deepfake Detection

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Text-to-image diffusion models utilize cross-attention to integrate textual information into the visual latent space, yet the transformation from text embeddings to latent features remains largely unexplored.
We provide a mechanistic analysis of the output-value (OV) circuits within cross-attention layers through spectral analysis via singular value decomposition.
Our analysis reveals that semantic concepts are encoded in low-dimensional subspaces spanned by singular vectors in OV circuits across cross-attention heads.
To verify this, we intervene on concept-related components in the diffusion process, demonstrating that intervention on identified spectral components affects conceptual changes.
We further validate these findings by examining visual outputs of isolated subspaces and their alignment with text embedding space.
Through this mechanistic understanding, we demonstrate that only nullifying these spectral components can achieve targeted concept removal with performance comparable to existing methods while providing interpretability.
Our work reveals how cross-attention layers encode semantic concepts in spectral subspaces of OV circuits, providing mechanistic insights and enabling precise concept manipulation without retraining.

Mechanistic Dissection of Cross-Attention Subspaces in Text-to-Image Diffusion Models

Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details.

CD-DPE: Dual-Prompt Expert Network Based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details.
We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a 6× improvement in maximum generatable sequence length compared to existing methods. 
To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process.
Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence

Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions.
In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally increases the transport cost to disrupt unreliable attention flows. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality.
ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.

Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. 
We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. 
Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations.
Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

Learning Spatial Decay for Vision Transformers

Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) exhibit impressive capabilities. However, their long-term learning is hampered by a core limitation: a failure to evolve beyond existing trajectories. This stems from memory systems that treat experiences as isolated fragments and rely on brittle semantic retrieval, preventing the synthesis of novel solutions from disparate knowledge. To address this, we introduce CA3Mem, a framework inspired by the human hippocampus that organizes experiences into a structured memory graph. Leveraging this graph, CA3Mem features two key innovations: 1) a generative memory recombination mechanism that synthesizes novel solutions to drive agent evolution, and 2) an associative retrieval algorithm that employs spreading activation to recall a comprehensive and contextually-aware set of experiences. Experiments on OSWorld and WebArena demonstrate that CA3Mem significantly enhances agent capabilities, leading to marked improvements in long-horizon planning, compositional generalization for novel tasks, and continuous adaptation from experience. The code is included in the supplementary materials.

Evolving Generalist Virtual Agents with Generative and Associative Memory

Detecting Out-Of-Distribution (OOD) samples in image classification is crucial for model reliability. With the rise of Vision-Language Models (VLMs), CLIP-OOD has become a research hotspot. However, we observe the Low Focus Attention phenomenon from the image encoders of CLIP, which means the attention of image encoders often spreads to non-in-distribution regions. This phenomenon comes from the semantic mismalignment and inter-class feature confusion. To address these issues, we propose a novel fine-tuned OOD detection method with the Double loss constraint based on Optimal Transport (DOT-OOD). DOT-OOD integrates the Double Loss Constraint (DLC) module and Optimal Transport (OT) module. The DLC module comprises the Aligned Image-Text Concept Matching Loss and the Negative Sample Repulsion Loss, which respectively (1) focus on the core semantics of ID images and achieve cross-modal semantic alignment, (2) expand inter-class distances and enhance discriminative. While the OT module is introduced to obtain enhanced image feature representations. Extensive experimental results show that in the 16-shot scenario of the ImageNet-1k benchmark, DOT-OOD reduces the FPR95 by over 10\% and improves the AUROC from 94.48\% to 96.57\% compared with SOTAs.

A Novel Fine-Tuned CLIP-OOD Detection Method with Double Loss Constraint Through Optimal Transport Semantic Alignment

Cross-market recommendation (CMR) faces severe challenges from distribution shifts between data-rich source markets and sparse target markets. Existing methods rely on a pre-training and fine-tuning paradigm for knowledge transfer, yet suffer from two key limitations: i) the objective gap between pre-training and full-parameter fine-tuning causes loss of generalized knowledge from source markets; ii) the high computational costs of extensive fine-tuning hinder scalability. To this end, we propose DCMPT, a novel Distilled Cross-Market Prompt-Tuning approach. DCMPT reframes the problem under a more efficient pre-training and prompt-tuning paradigm. Instead of full fine-tuning, we adapt a pre-trained universal backbone by freezing its weights and injecting a minimal set of learnable prompts to form a "student" model. To effectively optimize these prompts on sparse data, we introduce a novel teacher-student architecture: a specialized "teacher" model, trained exclusively on the target market, provides dense, market-specific supervision. This guidance is delivered via a dual distillation strategy designed to transfer global ranking patterns and adapt to local consumer tastes. Extensive experiments on real-world market datasets demonstrate that DCMPT significantly outperforms state-of-the-art methods, achieving superior target market performance with substantial parameter-efficiency. Code is provided in the supplementary material to ensure reproducibility.

Breaking Down Market Barriers: Distilled Prompt-Tuning Approach for Cross-Market Recommendation

As diffusion probabilistic models (DPMs) become central to Generative AI (GenAI), understanding their memorization behavior is essential for evaluating risks such as data leakage, copyright infringement, and trustworthiness. While prior research finds conditional DPMs highly susceptible to data extraction attacks using explicit prompts, unconditional models are often assumed to be safe. We challenge this view by introducing \textbf{Surrogate condItional Data Extraction (SIDE)}, a general framework that constructs data-driven surrogate conditions to enable targeted extraction from any DPM. Through extensive experiments on CIFAR-10, CelebA, ImageNet, and LAION-5B, we show that SIDE can successfully extract training data from so-called safe unconditional models, outperforming baseline attacks even on conditional models. Complementing these findings, we present a unified theoretical framework based on informative labels, demonstrating that all forms of conditioning, explicit or surrogate, amplify memorization. Our work redefines the threat landscape for DPMs, establishing precise conditioning as a fundamental vulnerability and setting a new, stronger benchmark for model privacy evaluation.

SIDE: Surrogate Conditional Data Extraction from Diffusion Models

Although deep learning-based methods have achieved promising performance in Pansharpening, they generally suffer from severe performance degradation when applied to data from unseen sensors. Existing cross-domain strategies, including retraining, fine-tuning, and zero-shot methods, fail to simultaneously preserve model architecture and maintain low adaptation costs. Therefore, we are the first to define and address a novel task in the pansharpening field: enhancing a model's cross-sensor generalization at an extremely low cost while keeping the model architecture invariant. To tackle this task, we propose SWIFT (Sensitive Weight Identification for Fast Transfer), a plug-and-play framework. SWIFT first employs an unsupervised manifold-based sampling strategy to efficiently select a high-fidelity subset the most informative target-domain samples. It then leverages this subset to probe a source-domain pre-trained model, identifying and updating only the weight subset most sensitive to the domain shift by analyzing the gradient behavior of its parameters. Extensive experiments demonstrate that SWIFT can be applied to various deep learning models, boosting adaptation efficiency by up to \textit{30-fold}. On a single NVIDIA RTX 4090 GPU, this reduces adaptation time from hours to as little as one minute. The adapted models not only substantially outperform direct-transfer baselines but also achieve performance competitive with, or even superior to full retraining while using only\textit{ 3\%} of the target domain dataset and adapting nearly 10\% to 30\% of the model’s parameters. This establishs a new state-of-the-art on the WorldView-2 and QuickBird datasets.

Downloads

Next from AAAI 2026

Mechanistic Dissection of Cross-Attention Subspaces in Text-to-Image Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Mechanistic Dissection of Cross-Attention Subspaces in Text-to-Image Diffusion Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads