Singapore

In industrial anomaly detection, the scarcity of diverse defective samples poses a major challenge to training robust and scalable models. To address this, we propose an efficient few-shot training framework for synthesizing industrial anomalies using diffusion models. Unlike prior generative methods that rely on redundant or semantically meaningless prompts (e.g., &quot;sks&quot;), our method leverages only normal data with minimal textual guidance. We build upon the Stable Diffusion 3 architecture and introduce lightweight architectural adaptations and a curated training strategy guided by vision-language models (VLMs). Our method generates realistic and diverse anomalies aligned with interpretable prompts such as &quot;scratch&quot; or &quot;broken component&quot;, and further allows spatial localization through prompt engineering. During inference, we adopt a multi-prompt strategy with attention modulation to enable precise and controllable anomaly synthesis. Experimental results demonstrate that our synthesized anomalies significantly enhance downstream anomaly detection performance and exhibit strong generalization across various industrial categories, even under limited supervision.

AAAI 2026

CHIMERA: Controllable High-quality Image-Mask Extraction for Reliable Diffusion-based Anomaly Synthesis

diffusion model anomaly synthesis efficient training

In industrial anomaly detection, the scarcity of diverse defective samples poses a major challenge to training robust and scalable models. To address this, we propose an efficient few-shot training framework for synthesizing industrial anomalies using diffusion models. Unlike prior generative methods that rely on redundant or semantically meaningless prompts (e.g., "sks"), our method leverages only normal data with minimal textual guidance. We build upon the Stable Diffusion 3 architecture and introduce lightweight architectural adaptations and a curated training strategy guided by vision-language models (VLMs). Our method generates realistic and diverse anomalies aligned with interpretable prompts such as "scratch" or "broken component", and further allows spatial localization through prompt engineering. During inference, we adopt a multi-prompt strategy with attention modulation to enable precise and controllable anomaly synthesis. Experimental results demonstrate that our synthesized anomalies significantly enhance downstream anomaly detection performance and exhibit strong generalization across various industrial categories, even under limited supervision.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content.

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Underwater object detection presents significant challenges due to the unique visual degradations in underwater environments, such as low contrast, poor visibility, and blurry object boundaries. While ANNs have achieved impressive detection accuracy, their high computational cost and power consumption limit their deployment in resource-constrained underwater platforms. In this work, we propose a Spatial-Frequency Spiking Neural Network (SFSNN) that combines the energy-efficient and event-driven nature of Spiking Neural Networks (SNNs) with the discriminative power of spatial-frequency analysis. SFSNN introduces a novel spatial-frequency spiking module that integrates spatial and frequency-domain representations, enhancing edge and texture features crucial for object detection in murky waters. Furthermore, we adapt the YOLOX architecture into a spike-based detector via ANN-to-SNN conversion using signed spiking neurons. Extensive experiments on the RUOD dataset demonstrate that SFSNN achieves superior performance over both SNN- and ANN-based detection models, offering a compelling solution for low-power underwater object detection.

Spatial-Frequency Spiking Neural Network for Underwater Object Detection

As generative audio models are rapidly evolving, AI-generated audios increasingly raise concerns about copyright infringement and misinformation spread. Audio watermarking, as a proactive defense, can embed secret messages into audio for copyright protection and source verification. However, current neural audio watermarking methods focus primarily on the imperceptibility and robustness of watermarking, while ignoring its vulnerability to security attacks. In this paper, we develop a simple yet powerful attack: the overwriting attack that overwrites the legitimate audio watermark with a forged one and makes the original legitimate watermark undetectable. Based on the audio watermarking information that the adversary has, we propose three categories of overwriting attacks, i.e., white-box, gray-box, and black-box attacks. We also thoroughly evaluate the proposed attacks on state-of-the-art neural audio watermarking methods. Experimental results demonstrate that the proposed overwriting attacks can effectively compromise existing watermarking schemes across various settings and achieve a nearly 100% attack success rate. The practicality and effectiveness of the proposed overwriting attacks expose security flaws in existing neural audio watermarking systems, underscoring the need to enhance security in future audio watermarking designs.

Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking

The Lipschitz bandit is a key variant of stochastic bandit problems where the expected reward function satisfies a Lipschitz condition with respect to an arm metric space. With its wide-ranging practical applications, various Lipschitz bandit algorithms have been developed, achieving the optimal regret performance in the classical setting. Motivated by recent advancements in quantum computing and the demonstrated success of quantum Monte Carlo in simpler bandit settings, we introduce the first quantum Lipschitz bandit algorithms to address the challenges of continuous action spaces and non-linear reward functions. Specifically, we first leverage the elimination-based framework to propose an efficient quantum Lipschitz bandit algorithm named Q-LAE. Next, we present novel modifications to the classical Zooming algorithm, which results in a simple quantum Lipschitz bandit method, Q-Zooming. Both algorithms exploit the computational power of quantum methods to obtain a provably improved regret bound over classical Lipschitz bandit algorithms. Comprehensive experiments further validate our improved theoretical findings, demonstrating superior empirical performance compared to existing Lipschitz bandit methods.

Quantum Lipschitz Bandits

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model—obtained by fine-tuning a T2I model on 3D human texture maps—for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments—separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks—and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Existing drift detection methods focus on designing sensitive test statistics. They treat the detection threshold as a fixed hyperparameter, set once to balance false alarms and late detections, and applied uniformly across all datasets and over time.
However, maintaining model performance is the key objective from the perspective of machine learning, and we observe that model performance is highly sensitive to this threshold.
This observation inspires us to investigate whether a dynamic threshold could be provably better. In this paper, we prove that a threshold that adapts over time can outperform any single fixed threshold. The main idea of the proof is that a dynamic strategy, constructed by combining the best threshold from each individual data segment, is guaranteed to outperform any single threshold that apply to all segments. Based on the theorem, we propose a Dynamic Threshold Determination algorithm. It enhances existing drift detection frameworks with a novel comparison phase to inform how the threshold should be adjusted. Extensive experiments on a wide range of synthetic and real-world datasets, including both image and tabular data, validate that our approach substantially enhances the performance of state-of-the-art drift detectors.

Autonomous Concept Drift Threshold Determination

Time series generation is essential for advancing data-driven modeling and decision-making across a wide range of domains. However, existing approaches primarily focus on global patterns, often failing to capture local key patterns such as abrupt changes or anomalies. These key patterns are crucial for interpretability and operational decision making, as they frequently represent intervention points with significant real-world impact. To bridge this gap, we propose Key Prototypes-Guided Diffusion (K-ProtoDiff) for time series generation , a new model that learns the global data distribution while preserving localized key patterns critical for temporal dynamics. In K-ProtoDiff, we first derive time series prototype representations through adaptive self-supervised learning. Then, a key prototype assignment module is used to extract prototype weights, forming key prototype-aware representations that serve as conditional guidance for generation. During sampling, to further enhance the fidelity of key patterns during the denoising process, we propose Reflection Sampling (R-Sampling), a step-wise refinement strategy that encourages the reverse trajectory to better align with key prototype constraints. Experiments on nine real-world datasets demonstrate that K-ProtoDiff significantly outperforms state-of-the-art baselines in key pattern retention, achieving an average 77.6% improvement in key pattern preservation.

K-ProtoDiff: Key Prototypes-Guided Diffusion for Time Series Generation

Point cloud completion aims to reconstruct complete 3D shapes from partial observations, which is a challenging problem due to severe occlusions and missing geometry. Despite recent advances in multimodal techniques that leverage complementary RGB images to compensate for missing geometry, most methods still follow a Completion-by-Inpainting paradigm, synthesizing missing structures from fused latent features. We empirically show that this paradigm often results in structural inconsistencies and topological artifacts due to limited geometric and semantic constraints. To address this, we rethink the task and propose a more robust paradigm, termed Completion-by-Correction, which begins with a topologically complete shape prior generated by a pretrained image-to-3D model and performs feature-space correction to align it with the partial observation. This paradigm shifts completion from unconstrained synthesis to guided refinement, enabling structurally consistent and observation-aligned reconstruction. Building upon this paradigm, we introduce PGNet, a multi-stage framework that conducts dual-feature encoding to ground the generative prior, synthesizes a coarse yet structurally aligned scaffold, and progressively refines geometric details via hierarchical correction. Experiments on the ShapeNetViPC dataset demonstrate the superiority of PGNet over state-of-the-art baselines in terms of average Chamfer Distance (-23.5\%) and F-score (+7.1\%).

Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective

Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal—especially in non-homophilic graphs—since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FULE, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FULE leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FULE in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.

Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption

Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms baselines in both segmentation and diagnosis.

Content not yet available

Next from AAAI 2026

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES