Singapore

The generalization capability of deepfake detectors is crucial for real-world applications. Data augmentation to generate synthetic fake faces has served as an effective strategy to enhance generalization. Interestingly, current state-of-the-art (SoTA) methods rely on fixed augmentation strategies, raising a fundamental question: Can a single static augmentation approach suffice, or does the diversity of forgery features necessitate dynamic strategies? We argue that existing methods overlook the evolving complexity of real-world forgery patterns, such as facial warping, expression manipulation, and compression artifacts, which cannot be fully simulated by fixed policies.
To bridge this gap, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework that guides the detector to progressively master multi-domain forgery features from simple to complex. CRDA synthesizes augmented samples using a configurable pool of forgery operations and dynamically generates adversarial samples tailored to the detector’s current learning state. 
Key to our approach is the integration of reinforcement learning (RL) and causal inference. To efficiently explore the vast augmentation space, an RL agent dynamically selects augmentation actions based on the detector’s performance, ensuring continuous adaptation to increasingly challenging forgeries. Simultaneously, the agent’s output is designed to introduce variations in action spaces, generating heterogeneous forgery patterns. These variations are guided by causal inference theory, which mitigates spurious correlations by suppressing task-irrelevant biases and enforcing the model to focus on causally invariant features. This integration ensures robust generalization by decoupling synthetic augmentation patterns from the model’s learned representations.
Extensive experiments demonstrate that the proposed method significantly improves the generalizability of the detector, achieving superior performance compared to state-of-the-art methods on multiple cross-domain datasets. Code is available at the supplementary material.

AAAI 2026

Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation

app: security; cv: applications; cv: bias

fairness & privacy

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The performance of Offline reinforcement learning is significantly impacted by the issue of state distributional shift, and out-of-distribution (OOD) state correction is a popular approach to address this problem. However, previous methods correct the agent's transition distributions in a supervised way, which significantly degrades the flexibility and robustness. In this paper, we propose a novel method named Density-Aware Safety Perception (DASP) for OOD state correction. Specifically, our method encourages the agent to prioritize actions that lead to outcomes with higher data density, thereby promoting its operation within or the return to in-distribution (safe) regions. To achieve this, we optimize the objective within a variational framework that concurrently considers both the potential outcomes of decision-making and their density, thus providing crucial contextual information for safe decision-making. Finally, we validate the effectiveness and feasibility of our proposed method through extensive experimental evaluations on the offline MuJoCo and AntMaze suites.

Variational OOD State Correction for Offline Reinforcement Learning

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Low-rank tensor decompositions (TDs) provide an effective framework for multiway data analysis. Traditional TD methods rely on predefined structural assumptions, such as CP or Tucker decompositions. From a probabilistic perspective, these methods effectively model the relationships between latent factors and the low-rank tensor using Dirac delta distributions. However, tensor low-rank decomposition is inherently non-unique, leading to a multimodal distribution over possible solutions. Critically, such prior knowledge is rarely available in practical scenarios, particularly regarding the optimal rank structure and contraction rules. To address this issue, we propose a score-based model that eliminates the need for predefined structural or distributional assumptions, enabling the learning of compatibility between tensors and latent factors. Specifically, a neural network is designed to learn the energy function, which is optimized via score matching to capture the gradient of the joint log-probability of tensor entries and latent factors. Our method allows for modeling structures and distributions beyond the Dirac delta assumption. Moreover, integrating the block coordinate descent (BCD) algorithm with the proposed smooth regularization enables the model to perform both tensor completion and denoising. Experimental results demonstrate significant performance improvements across various tensor types, including sparse and continuous-time tensors, as well as visual data.

Score-Based Model for Low-Rank Tensor Recovery

Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion

Improving the diversity of generated results while maintaining high visual quality remains a significant challenge in image generation tasks. Fractal Generative Models (FGMs) are efficient in generating high-quality images, but their inherent self-similarity limits the diversity of output images. To address this issue, we propose a novel approach based on the Hausdorff Dimension (HD), a widely recognized concept in fractal geometry to quantify structural complexity, which aids in enhancing the diversity of generated outputs. To incorporate HD into FGM, we propose a learnable HD estimation method that predicts HD directly from image embeddings, addressing computational cost concerns and enabling efficient integration into generative modeling. Moreover, simply introducing HD as an auxiliary loss is insufficient to enhance diversity in FGMs. To this end, during training, we adopt an HD-based loss with a momentum-driven weighting strategy to progressively optimize hyperparameters to gain best diversity without sacrificing visual quality. Besides, during inference, we employ HD-guided rejection sampling to select geometrically richer outputs. Extensive experiments on the ImageNet dataset demonstrate that our FGM-HD framework yields a 39\% improvement in output diversity, compared to the baseline fractal model, while preserving comparable image quality. To our knowledge, this is the very first work introducing the Hausdorff Dimension into FGM. Our method effectively enhances the diversity of generated outputs while offering a principled theoretical contribution to the development of fractal-based generative models.

FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction

We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a strong baseline that make an initial attempt for 4D stylization, 2) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Please refer to the supplementary material for demo video.

Style4D-Bench: A Benchmark Suite for 4D Stylization

Understanding motion is crucial for visual object tracking in complex and dynamic motion scenarios. However, existing methods often rely on simple template updates or temporal feature propagation, neglecting the effective mining and utilization of motion information. To address this issue, we propose a motion-aware spatio-temporal framework that achieves motion perception by explicitly matching motion patterns and modeling motion relationships between frames. Specifically, our method introduces a motion pattern dictionary that encodes diverse and representative motion patterns as learnable features, enabling effective motion modeling. During tracking, features from the search region retrieve the most relevant motion patterns from the dictionary to capture current motion dynamics. The decoder then integrates temporal motion correlations for enhanced motion awareness. Additionally, we incorporate geometric cues into the search region features to enhance spatial perception, mitigate occlusion-induced ambiguity, and improve foreground-background separation. Extensive experiments on seven challenging benchmarks show that our approach consistently outperforms existing methods, confirming the effectiveness of motion pattern modeling and geometry-guided enhancement in alleviating tracking drift. Our MoDTrack achieves a 1.2\% higher AUC score on the LaSOT benchmark compared to the latest state-of-the-art methods, further validating the superiority of our approach.

Motion-Aware Object Tracking via Motion and Geometry-Aware Cues

Microvascular invasion (MVI) is a critical prognostic factor that significantly impacts postoperative outcomes in hepatocellular carcinoma (HCC). As the current gold standard for the diagnosis of MVI is based on the postoperative histopathological examination of whole slide images, accurate preoperative prediction of MVI status using magnetic resonance imaging (MRI) presents both a substantial clinical imperative and a significant challenge. In order to discover reliable MRI-based imaging biomarkers to support clinical decision making and enhance the interpretability of deep learning-based diagnostic models, we propose a novel interpretable MVI prediction framework in which the shared latent visual attributes are first learned and then used for potential imaging biomarker extraction and MVI diagnosis, respectively. To ensure that the visual attributes of these biomarkers are generalizable across diverse patients, the similarity constraints at the intra-patient level and the inter-patient level are enforced within the learned feature space, enabling intuitive biomarker discovery directly from the original image space. To guarantee semantic alignment between biomarkers and the characteristics of individual patients, we introduce a novel classification mechanism that directly links the alignment between each biomarker and patient-specific characteristics with the prediction outcome, thereby ensuring a precise prediction of MVI. Furthermore, the interpretability of the model is enhanced by integrating a mask-based visual explanation method that regions in patient images that correspond to the identified biomarkers. Extensive experiments on two MVI prediction datasets: HCC-WCH and HCC-ZSH unequivocally demonstrate our method's superior performance in both classification accuracy and interpretability. Our code will be made publicly available shortly after publication.

Learning Latent Imaging Biomarkers for Interpretable Microvascular Invasion Prediction in Hepatocellular Carcinoma

Despite the tremendous success of neural networks, benign images can be corrupted by adversarial perturbations to deceive these models. Intriguingly, images differ in their attackability. Specifically, given an attack configuration, some images are easily corrupted, whereas others are more resistant. Evaluating image attackability has important applications in active learning, adversarial training, and attack enhancement. This prompts a growing interest in developing attackability measures. However, existing methods are scarce and suffer from two major limitations: (1) They rely on a model proxy to provide prior knowledge (e.g., gradients or minimal perturbation) to extract model-dependent image features. Unfortunately, in practice, many task-specific models are not readily accessible. (2) Extracted features characterizing image attackability lack visual interpretability, obscuring their direct relationship with the images. To address these, we propose a novel **Object Texture Intensity (OTI)**, a model-free and visually interpretable measure of image attackability, which measures image attackability as the texture intensity of the image's semantic object. Theoretically, we describe the principles of OTI from the perspectives of decision boundaries as well as the mid- and high-frequency characteristics of adversarial perturbations. Comprehensive experiments demonstrate that OTI is effective and computationally efficient. In addition, our OTI provides the adversarial machine learning community with a visual understanding of attackability.

OTI: A Model-free and Visually Interpretable Measure of Image Attackability

High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods. The codes will be open-sourced upon acceptance of the paper.

Content not yet available

Next from AAAI 2026

Variational OOD State Correction for Offline Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES