Singapore

Latent Diffusion Models have become a powerful tool for generating high-fidelity unrestricted adversarial examples. However, the existing methods typically perturb only the initial latent or rely on prompt engineering, which is ill-suited to the iterative nature of the diffusion process, plus optimization instability due to external text prompts and cumulative drift that push the adversarial images off the data manifold. In this paper, we propose a hierarchical attack framework that operates in alignment with the model&#39;s generative manifold and leverages intermediate denoising states to maximize attack transferability and visual fidelity. Extensive experiments show that the proposed attack improves adversarial transferability by $10$-$20$\% against a diverse set of normally-trained models and achieves over 10.5\% higher success rate against adversarially-defended models, while simultaneously enhancing visual quality by $1.0$-$1.2$ FID reduction and 16.7\% LPIPS improvements.

AAAI 2026

Beyond Single-Point Perturbation: A Hierarchical, Manifold-Aware Approach to Diffusion Attacks

cv: diffusion models for vision

ml: adversarial learning & robustness

cv: adversarial attacks & robustness

Latent Diffusion Models have become a powerful tool for generating high-fidelity unrestricted adversarial examples. However, the existing methods typically perturb only the initial latent or rely on prompt engineering, which is ill-suited to the iterative nature of the diffusion process, plus optimization instability due to external text prompts and cumulative drift that push the adversarial images off the data manifold. In this paper, we propose a hierarchical attack framework that operates in alignment with the model's generative manifold and leverages intermediate denoising states to maximize attack transferability and visual fidelity. Extensive experiments show that the proposed attack improves adversarial transferability by $10$-$20$\% against a diverse set of normally-trained models and achieves over 10.5\% higher success rate against adversarially-defended models, while simultaneously enhancing visual quality by $1.0$-$1.2$ FID reduction and 16.7\% LPIPS improvements.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Learning Curve Extrapolation (LCE) is a critical technique for accelerating automated machine learning by terminating unpromising training runs early. Recent state-of-the-art methods have improved predictive accuracy by incorporating contextual information, such as neural network architecture. However, these approaches, whether context-agnostic or architecture-aware, still operate under the implicit assumption of a uniform task landscape. They overlook a pivotal, complementary factor: the intrinsic difficulty of the learning task itself. This oversight leads to a significant degradation in performance, especially for tasks whose learning dynamics diverge from the model's priors. In this work, we argue that task difficulty is a crucial yet neglected dimension for robust LCE. We introduce a novel framework, Difficulty-Adaptive Learning Curve Extrapolation (DA-LCE), which explicitly conditions its predictions on task complexity. Our core contributions are threefold: (1) We propose a transparent, {rule-based method} to quantify task difficulty from the early shape of learning curves, eliminating the need for external meta-features. (2) We design a novel data generation pipeline using a {conditional diffusion model} to create a high-fidelity, difficulty-conditioned synthetic prior for training. (3) We introduce a {Conditional Difficulty-aware PFN (CD-PFN)} that leverages this information to achieve superior predictive accuracy. Extensive experiments on a wide range of benchmarks demonstrate that our CD-PFN significantly outperforms both difficulty-agnostic baselines and even state-of-the-art architecture-aware models. This result highlights that task difficulty is a powerful, complementary source of information, whose impact can be as significant as, or even greater than, that of the model architecture.

Difficulty-Aware Learning Curve Extrapolation

Spiking Neural Networks (SNNs) are emerging as a promising energy-efficient alternative to Artificial Neural Networks (ANNs) due to their event-driven computation paradigm. However, recent advances toward large-scale high-performance SNNs inevitably lead to substantial memory and computational overhead. While quantization offers a potential solution, many quantization approaches fail to deliver verifiable efficiency gains on resource-constrained hardware platforms. In this paper, we propose a lightweight and hardware-friendly SNN that applies quantization to both weights and membrane potentials, termed HardF-SNN. Specifically, we first build a baseline model that adopts shared-scale quantization and batch normalization (BN) folding to simulate integer-only inference during training, since this baseline model has not been thoroughly discussed in previous SNN work. Although the baseline enables integer-arithmetic-only inference, it suffers from performance degradation and may even lead to training failure. To address these issues, we thoroughly analyze the problems caused by quantization and BN folding, and propose solutions to enhance the baseline’s performance. Specifically, we introduce proportional shared-scale quantization to enhance the representation capability, and propose an integer-only BN method to stabilize training convergence through integer arithmetic and bit-shifting operations. Extensive experiments show that HardF-SNN achieves an optimal balance between performance and efficiency, exhibiting excellent compatibility with mainstream hardware accelerators. To demonstrate its effectiveness on resource-constrained platforms, HardF-SNN is deployed on a dedicated FPGA-based hardware accelerator. Evaluation results indicate that our implementation surpasses current state-of-the-art accelerators.

HardF-SNN: Hardware-Friendly Quantization for Spiking Neural Networks with Efficient Integer-Arithmetic-Only Inference

Federated learning (FL) protects data privacy by enabling distributed model training without direct access to client data. However, its distributed nature makes it vulnerable to model and data poisoning attacks. While numerous defenses filter malicious clients using statistical metrics, they overlook the role of model redundancy, where not all parameters contribute equally to the model/attack performance. Current attacks manipulate all model parameters uniformly, making them more detectable, while defenses focus on the overall statistics of client updates, leaving gaps for more sophisticated attacks. We propose an attack-agnostic augmentation method to enhance the stealthiness and effectiveness of existing poisoning attacks in FL, exposing flaws in current defenses and highlighting the need for fine-grained FL security. Our three-stage methodology—pill construction, pill poisoning, and pill injection—injects poison into a compact subnet (i.e., pill) of the global model during the iterative FL training. Experimental results show that FL poisoning attacks enhanced by our method can bypass 8 state-of-the-art (SOTA) defenses, gaining an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.

Poisoning with a Pill: Circumventing Detection in Federated Learning

Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media.
However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions.
We propose PanFlow, a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions.
We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries.
To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations.
We also showcase the effectiveness of our method in various applications, including motion transfer and video editing.
Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence.

PanFlow: Decoupled Motion Control for Panoramic Video Generation

Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval–augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper‑ and Wiki‑VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.13\% in soft exact match (EM) and 48.4\% in IoU\@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.

Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Text-to-SQL is a fundamental yet challenging task in the NLP area, aiming at translating natural language questions into SQL queries. While recent advances in large language models have greatly improved performance, most existing approaches depend on models with tens of billions of parameters or costly APIs, limiting their applicability in resource-constrained environments. For real world, especially on edge devices, it is crucial for Text-to-SQL to ensure cost-effectiveness. Therefore, enabling the light-weight models for Text-to-SQL is of great practical significance. However, smaller LLMs often struggle with complicated user instruction, redundant schema linking or syntax correctness. To address these challenges, we propose \textbf{MCTS-SQL}, a novel framework that uses Monte Carlo Tree Search to guide SQL generation through multi-step refinement. Since the light-weight models' weak performance of single-shot prediction, we generate better results through several trials with feedback. However, directly applying MCTS-based methods inevitably leads to significant time and computational overhead. Driven by this issue, we propose a token-level \textbf{prefix-cache mechanism} that stores prior information during iterations, effectively improved the execution speed. Experiments results on the SPIDER and BIRD benchmarks demonstrate the effectiveness of our approach. Using a small open-source Qwen2.5-Coder-1.5B, our method outperforms ChatGPT-3.5. When leveraging a more powerful model Gemini 2.5 to explore the performance upper bound, we achieved results competitive with the SOTA. Our findings demonstrate that even small models can be effectively deployed in practical Text-to-SQL systems with the right strategy.

MCTS-SQL: Light-Weight LLMs Can Master the Text-to-SQL Through Monte Carlo Tree Search

Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models.
Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. 
Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations.
Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety.
Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields.
The dataset and code are provided in supplementary.

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

We introduce OceanSplat, a novel method that captures geometric structures in scattering media with high fidelity for real-time 3D underwater scene representation. To overcome the severe attenuation and scattering effects inherent in underwater environments, our method imposes trinocular stereo consistency on views translated along two orthogonal axes, effectively constraining the spatial placement of 3D Gaussians and preserving object geometry under complex medium conditions. We further apply self-supervised geometric regularization using a synthetic epipolar depth prior derived from these translated views to suppress medium-induced misplacement of 3D Gaussians. In addition, we align the rendered depth with the $z$-component of individual 3D Gaussians to suppress floaters and enhance structural fidelity. Furthermore, we propose a depth-aware alpha adjustment module that uses directional and depth cues to guide visibility learning in the early stages of training, preventing erroneous placements and medium entanglement. Our method effectively represents 3D scenes under scattering media without external geometric cues by preventing foreground 3D Gaussians from erroneously contributing to the medium in novel views, thereby preserving overall scene quality. Experiments on real-world underwater and simulated scenes demonstrate that our method outperforms prior approaches in 3D scene representation under scattering media.

OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Deep learning-based image manipulation localization (IML) methods have achieved remarkable performance in recent years, but typically rely on large-scale pixel-level annotated datasets. To address the challenge of acquiring high-quality annotations, some recent weakly supervised methods utilize image-level labels to segment manipulated regions. However, the performance is still limited due to insufficient supervision signals. In this study, we explore a form of weak supervision that improves the annotation efficiency and detection performance, namely scribble annotation supervision. We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML (Sc-IML) dataset. Additionally, we propose the first scribble-based weakly supervised IML framework. Specifically, we employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions under multi-scale and augmented inputs. In addition, we propose a prior-aware feature modulation module (PFMM) that adaptively integrates prior information from both manipulated and authentic regions for dynamic feature adjustment, further enhancing feature discriminability and prediction consistency in complex scenes. We also propose a gated adaptive fusion module (GAFM) that utilizes gating mechanisms to regulate information flow during feature fusion, guiding the model toward emphasizing potential tampered regions. Finally, we propose a confidence-aware entropy minimization loss (${\mathcal{L}}_{ {CEM }}$). This loss dynamically regularizes predictions in weakly annotated or unlabeled regions based on model uncertainty, effectively suppressing unreliable predictions. Experimental results show that our method outperforms existing fully supervised approaches in terms of average performance both in-distribution and out-of-distribution. Our Sc-IML dataset and code will be released upon acceptance.

Downloads

Next from AAAI 2026

Difficulty-Aware Learning Curve Extrapolation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Difficulty-Aware Learning Curve Extrapolation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads