Singapore

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We introduce MotionFlow, a novel, training-free framework for motion transfer in pre-trained video diffusion models. MotionFlow uniquely leverages cross-attention maps by guiding a test-time optimization of latent representations to align the generated video&#39;s attention patterns with those extracted from a source motion. This approach enables the capture and manipulation of complex spatial and temporal dynamics for seamless motion transfer across diverse contexts. Unlike methods relying on direct attention map replacement, which can introduce artifacts, or those requiring model-specific training, MotionFlow operates solely at test-time, robustly handling significant scene and appearance alterations. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing methods in motion fidelity, temporal consistency, and versatility, even during drastic scene transformations.

AAAI 2026

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

cv: diffusion models for vision

ml: deep generative models & autoencoders

image & video synthesis

cv: computational photography

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We introduce MotionFlow, a novel, training-free framework for motion transfer in pre-trained video diffusion models. MotionFlow uniquely leverages cross-attention maps by guiding a test-time optimization of latent representations to align the generated video's attention patterns with those extracted from a source motion. This approach enables the capture and manipulation of complex spatial and temporal dynamics for seamless motion transfer across diverse contexts. Unlike methods relying on direct attention map replacement, which can introduce artifacts, or those requiring model-specific training, MotionFlow operates solely at test-time, robustly handling significant scene and appearance alterations. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing methods in motion fidelity, temporal consistency, and versatility, even during drastic scene transformations.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored.In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects. Code and sample data are available at \url{https://anonymous.4open.science/r/DanceHA}.

DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for its accurate recognition, enhanced facial privacy protection and robustness to various attacks. However, there are limited studies to further verify the privacy risks by extracting realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that the reconstructed faces can be used for accessing other real-word FR systems.
Besides, the proposed method shows the robustness in reconstructing faces from partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage.

Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance–music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches.

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.

Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-tuning

Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability, only working for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high-quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to improve sensitivity to the region of interest. Our experiments show that MAISI-v2 can achieve state-of-the-art image quality with 33× acceleration for latent diffusion models. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.

MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

While conventional computer vision emphasizes pixel-level and feature-based objectives, medical image analysis of intricate biological structures necessitates explicit representation of their complex topological properties. Despite their successes, deep learning models often struggle to accurately capture the connectivity and continuity of fine, sometimes pixel-thin, yet critical structures due to their reliance on implicit learning from data. To address this challenge, we introduce Conformable Convolution, a novel convolutional layer designed to explicitly impose topological consistency. Conformable Convolution learns adaptive kernel offsets that focus on regions of high topological significance within an image. This prioritization is guided by our proposed Topological Posterior Generator (TPG) module, which leverages persistent homology. The TPG module identifies key topological features and guides the convolutional layers by applying persistent homology to feature maps transformed into cubical complexes. Unlike existing approaches that are merely aware of topology, our method explicitly constrains the learning process to ensure topological correctness. The proposed modules are architecture-agnostic, enabling them to be integrated seamlessly into various architectures. We showcase the effectiveness of our framework in the segmentation task, where preserving the interconnectedness of structures is critical. The results on three diverse datasets demonstrate that our framework effectively preserves the topology both quantitatively and qualitatively. The source code of this work will be published upon its acceptance.

Conformable Convolution for Topologically Constrained Learning of Complex Anatomical Structures

As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a "self-consuming loop" that can lead to training instability or *model collapse*. Common strategies to address the issue---such as accumulating historical training data or injecting fresh real data---either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose *Latent Space Filtering* (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Guided Perturbation Sensitivity (GPS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-$k$ critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 88\% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks ($\rho = 0.65$). GPS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Custom Diffusion Models (CDMs) offer impressive capabilities for personalization in generative modeling, yet they remain vulnerable to catastrophic forgetting when learning new concepts sequentially. Existing approaches primarily focus on minimizing interference between concepts, often neglecting the potential for positive inter-concept interactions. In this work, we present FLLP (Forget Less by Learning from Parent), a novel framework that introduces a parent-child inter-concept learning mechanism in hyperbolic space to mitigate forgetting. By embedding concept representations within a Lorentzian manifold, naturally suited for modeling tree-like hierarchies, we define parent-child relationships where previously learned concepts act as guidance for adapting to new ones. Our method not only preserves prior knowledge but also supports continual integration of new concepts. We validate FLLP on three public datasets and one synthetic benchmark, showing consistent improvements in both robustness and generalization.

Forget Less by Learning from Parents Through Hierarchical Relationships

Despite significant advancements in point cloud analysis, reducing energy consumption and improving robustness remain understudied, largely due to the inherent limitations of Convolutional Neural Networks (CNNs). To address this, we take the cue from the primary visual cortex and propose a Dendritic-Connected Continuous-Coupled Neural Network (DC-CCNN), a novel Brain-Inspired Neural Network (BINN) architecture tailored for point cloud analysis. By leveraging the unique characteristics of point clouds, our design combines discrete and continuous encoding, replacing traditional Multilayer Perceptrons (MLPs) with more efficient and robust BINNs. Our approach substantially improves the performance of Brain-Inspired Neural Networks on point analysis tasks and maintaining performance comparable to state-of-the-art methods. Furthermore, DC-CCNN exhibits enhanced robustness against various point cloud deformations and corruptions. Our experimental results demonstrate that DC-CCNN achieves competitive performance on benchmark datasets, making it a promising alternative to traditional deep learning methods for point cloud analysis. With its high efficiency and robustness, DC-CCNN has the potential for widespread adoption in 3D computer vision, robotics, and autonomous systems.

Content not yet available

Next from AAAI 2026

DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES