Singapore

Diffusion models have achieved remarkable success in image and video generation. However, their inherent multi-step inference process results in substantial computational overhead during inference, posing significant challenges for real-world deployment. Therefore, accelerating diffusion models is of great practical importance. Existing acceleration techniques include model quantization, model pruning, sampler optimization, step reduction, and compilation-level optimization.
Determining how to effectively combine multiple acceleration techniques to achieve optimal performance for a given diffusion model remains a major challenge for engineers. To address this, we propose the Diffusion Optimization Agent, an automated framework designed to generate the optimal acceleration strategy and corresponding code for any given diffusion model. Additionally, we introduce DiffBench, a comprehensive benchmark covering diverse diffusion model pipelines, combinations of optimization techniques, and acceleration tasks.
This paper presents a detailed description of the DiffBench construction process and the design principles of the Diffusion Optimization Agent. Extensive experiments demonstrate that our agent significantly outperforms current state-of-the-art large language models (LLMs) in generating effective acceleration strategies for diffusion models.

AAAI 2026

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

nlp: applications

ml: optimization

cv: applications

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Survival prediction of cancers is crucial for clinical practice, as it informs mortality risks and influences treatment plans. However, a $\textit{static}$ model trained on a single dataset fails to adapt to the $\textit{dynamically evolving}$ clinical environment and continuous data streams, limiting its practical utility. While continual learning (CL) offers a solution to learn dynamically from new datasets, existing CL methods primarily focus on unimodal inputs and suffer from severe catastrophic forgetting in survival prediction. In real-world scenarios, multimodal inputs often provide comprehensive and complementary information, such as whole slide images and genomics; and neglecting inter-modal correlations negatively impacts the performance. To address the two challenges of $\textit{catastrophic forgetting}$ and $\textit{complex inter-modal interactions}$ between gigapixel whole slide images and genomics, we propose $\textbf{ConSurv}$, the $\textbf{first}$ multimodal continual learning (MMCL) method for survival analysis. ConSurv incorporates two key components: Multi-staged Mixture of Experts (MS-MoE) and Feature Constrained Replay (FCR). MS-MoE captures both task-shared and task-specific knowledge at different learning stages of the network, including two modality encoders and the modality fusion component, learning inter-modal relationships. FCR further enhances learned knowledge and mitigates forgetting by restricting feature deviation of previous data at different levels, including encoder-level features of two modalities, as well as the fusion-level representations. Additionally, we introduce a new benchmark integrating four datasets, Multimodal Survival Analysis Incremental Learning (MSAIL), for comprehensive evaluation in the CL setting. Extensive experiments demonstrate that ConSurv outperforms competing methods across multiple metrics. Our code is provided in the supplementary material and will be made publicly available upon publication.

ConSurv: Multimodal Continual Learning for Survival Analysis

In the realm of autonomous driving, accurately detecting surrounding obstacles is crucial for effective decision-making. Traditional methods primarily rely on 3D bounding boxes to represent these obstacles, which often fail to capture the complexity of irregularly shaped, real-world objects. To overcome these limitations, we present GUIDE, a novel framework that utilizes 3D Gaussians for instance detection and occupancy prediction. Unlike conventional occupancy prediction methods, GUIDE also offers robust tracking capabilities. Our framework employs a sparse representation strategy, using Gaussian-to-Voxel Splatting to provide fine-grained, instance-level occupancy data without the computational demands associated with dense voxel grids. Experimental validation on the nuScenes dataset demonstrates GUIDE's performance, with an instance occupancy mAP of 21.61, marking a 50% improvement over existing methods, alongside competitive tracking capabilities. GUIDE establishes a new benchmark in autonomous perception systems, effectively combining precision with computational efficiency to better address the complexities of real-world driving environments.

GUIDE: Gaussian Unified Instance Detection for Enhanced Obstacle Perception in Autonomous Driving

Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.

Complex Instruction Following with Diverse Style Policies in Football Games

Diffusion models have recently been adopted for point cloud upsampling due to their effectiveness in solving ill-posed problems. However, existing upsampling methods often struggle with inefficiencies, as they generate dense point clouds by mapping Gaussian noise to data, overlooking the geometric information already present in sparse inputs. To address this, we propose PUFM, a novel Point Cloud Upsampling via Flow Matching, which learns to directly transform sparse point clouds into their high-fidelity dense counterparts. Our approach first applies midpoint interpolation to densify the sparse input. Then, we construct a continuous interpolant between sparse and dense point clouds and train a neural network to estimate the velocity field for flow matching. Given the unordered nature of point clouds, we introduce a pre-alignment step based on Earth Mover's Distance (EMD) optimization to ensure coherent and meaningful interpolation between sparse and dense representations. This results in a more stable and efficient learning trajectory during flow matching. Experiments on synthetic benchmarks demonstrate that our method delivers superior upsampling quality but with fewer sampling steps. Further experiments on ScanNet and KITTI also show that our approach generalizes well to real-world RGB-D and LiDAR point clouds, making it more practical for real-world applications.

PUFM: Efficient Point Cloud Upsampling via Flow Matching

Protein design is revolutionizing biotechnology, yet existing approaches struggle to balance structural foldability with functional performance. Structure-based models excel at generating stable protein backbones but often overlook critical functional properties, while protein language models capture evolutionary and functional signals but frequently predict sequences lacking structural stability. Integrating these complementary approaches remains challenging due to their inherently conflicting objectives.
We present MAProt, a multi-agent framework that synergistically combines structure-based and protein language model-based methods for protein design. Each agent specializes in a distinct aspect of the design objective: the structure-based agent (e.g., ProteinMPNN) ensures compatibility with the target backbone, while protein language model-based agents (e.g., ESM, SaProt) capture evolutionary plausibility and functional potential. To reconcile conflicts and achieve optimal trade-offs, we introduce a Pareto-based negotiation module that enables effective multi-objective coordination and consensus among agents.
Extensive experiments on benchmark datasets demonstrate that MAProt achieves a remarkable improvement over state-of-the-art baselines, and generalizes robustly across a range of tasks, including thermodynamic folding stability design, functional protein design, and high-affinity antibody design. These results highlight the power of collaborative optimization for advancing rational protein engineering.

Advancing Protein Design via Multi-Agent Reinforcement Learning with Pareto-Based Collaborative Optimization

Event cameras are bio-inspired sensors that capture visual information through asynchronous brightness changes, offering distinct advantages including high temporal resolution and wide dynamic range. While prior research has investigated event-based 3D reconstruction for extreme scenarios, existing methods face inherent limitations and fail to fully exploit the unique characteristics of event data.
In this paper, we present EvDiff3D, a novel two-stage 3D reconstruction framework that integrates event-based geometric constraints with an event-aware diffusion prior for appearance refinement. Our key insight lies in bridging the gap between physically grounded event-based reconstruction and data-driven appearance repair through a unified cyclical pipeline. In the first stage, we reconstruct a coarse 3D scene under supervision from event loss and event-based monocular depth constraints to preserve structural fidelity. 
The second stage fine-tunes an event-aware diffusion model based on a pretrained video diffusion model as a repair prior to enhance the appearance in under-constrained regions.
Based on the diffusion model, our pipeline operates within a reconstruction-generation cycle that progressively refines both geometry and appearance using only event data.
Extensive experiments on synthetic and real-world datasets demonstrate that EvDiff3D significantly outperforms existing methods in perceptual quality and structural consistency.

EvDiff3D: Event-Aware Diffusion Repair for High-Fidelity Event-Based 3D Reconstruction

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion.
In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation.
To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Multimodal Retrieval-Augmented Generation (MRAG) has recently been explored to empower Large Vision Language Models (LVLMs) with more comprehensive and up-to-date contextual knowledge, aiming to compensate for their limited and coarse-grained parametric knowledge in knowledge-intensive tasks. 
However, the retrieved contextual knowledge is usually not aligned with LVLMs’ internal parametric knowledge, leading to knowledge conflicts and further unreliable or inconsistent LVLM responses. 
To tackle this issue, we design KCM, a training-free and plug-and-play framework that can effectively mitigate knowledge conflicts while incorporating MRAG for more accurate LVLM responses. 
KCM enhances contextual knowledge utilization by modifying the LVLM architecture from three key perspectives. First, KCM adaptively adjusts attention distributions among multiple attention heads, encouraging LVLMs to focus on contextual knowledge with reduced distraction. 
Second, KCM identifies and prunes knowledge-centric LVLM neurons that encode coarse-grained parametric knowledge, thereby suppressing interferences and enabling more effective integration of contextual knowledge. 
Third, KCM amplifies the information flow from the input context by injecting supplementary context logits, reinforcing its contribution to the final output. 
Extensive experiments over multiple widely adopted LVLMs and benchmarks show that KCM outperforms the state-of-the-art consistently by large margins, incurring neither extra training nor external tools. Code and data will be released.

Enhancing Retrieval-Augmented Large Vision Language Models via Knowledge Conflict Mitigation

Large language models (LLMs) concentrate substantial knowledge in specialized domains due to extensive pretraining and instruction tuning, and they are now central to commercial and scientific practice. Yet access is usually limited to costly, rate limited interfaces, which motivates methods that can extract targeted domain knowledge with minimal querying effort. A further challenge is that the target domain may be unknown in advance, so naive or generic prompts waste queries and fail to expose the underlying concepts and relations that structure the domain.
In this work, we introduce a query efficient approach for domain specific knowledge stealing from black box language models. Rather than issuing random questions or generic templates, our framework performs self directed exploration that lets the model find the direction and mine domain knowledge by itself. Starting from a small and diverse seed, it discovers salient domain entities and induces their relations through structured question families that elicit definitional, functional, and compositional information. A feedback driven controller analyzes the errors and uncertainty of the extracted student model and uses this signal to refine subsequent queries, all without any prior domain knowledge or external resources.
We evaluate the method in two expert centric settings, medicine and finance, and observe consistently better performance while requiring significantly fewer queries.

Query-Efficient Domain Knowledge Stealing Against Large Language Models

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do.
Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame.
This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction.
To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics.
Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling.
Our core source code and training logs are available as the supplement.

Downloads

Next from AAAI 2026

ConSurv: Multimodal Continual Learning for Survival Analysis

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES