Singapore

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While large language models (LLMs) offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes.

To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation.

Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3× speedup over LLMs, and 2.2× over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34× speedup. All models and datasets will be made publicly available.

AAAI 2026

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

gpu kernel

code generation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 μm lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to "identity mapping", where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid "identity mapping" by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.

WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

Diffusion models have achieved remarkable success in image and video generation. However, their inherent multi-step inference process results in substantial computational overhead during inference, posing significant challenges for real-world deployment. Therefore, accelerating diffusion models is of great practical importance. Existing acceleration techniques include model quantization, model pruning, sampler optimization, step reduction, and compilation-level optimization.
Determining how to effectively combine multiple acceleration techniques to achieve optimal performance for a given diffusion model remains a major challenge for engineers. To address this, we propose the Diffusion Optimization Agent, an automated framework designed to generate the optimal acceleration strategy and corresponding code for any given diffusion model. Additionally, we introduce DiffBench, a comprehensive benchmark covering diverse diffusion model pipelines, combinations of optimization techniques, and acceleration tasks.
This paper presents a detailed description of the DiffBench construction process and the design principles of the Diffusion Optimization Agent. Extensive experiments demonstrate that our agent significantly outperforms current state-of-the-art large language models (LLMs) in generating effective acceleration strategies for diffusion models.

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Survival prediction of cancers is crucial for clinical practice, as it informs mortality risks and influences treatment plans. However, a $\textit{static}$ model trained on a single dataset fails to adapt to the $\textit{dynamically evolving}$ clinical environment and continuous data streams, limiting its practical utility. While continual learning (CL) offers a solution to learn dynamically from new datasets, existing CL methods primarily focus on unimodal inputs and suffer from severe catastrophic forgetting in survival prediction. In real-world scenarios, multimodal inputs often provide comprehensive and complementary information, such as whole slide images and genomics; and neglecting inter-modal correlations negatively impacts the performance. To address the two challenges of $\textit{catastrophic forgetting}$ and $\textit{complex inter-modal interactions}$ between gigapixel whole slide images and genomics, we propose $\textbf{ConSurv}$, the $\textbf{first}$ multimodal continual learning (MMCL) method for survival analysis. ConSurv incorporates two key components: Multi-staged Mixture of Experts (MS-MoE) and Feature Constrained Replay (FCR). MS-MoE captures both task-shared and task-specific knowledge at different learning stages of the network, including two modality encoders and the modality fusion component, learning inter-modal relationships. FCR further enhances learned knowledge and mitigates forgetting by restricting feature deviation of previous data at different levels, including encoder-level features of two modalities, as well as the fusion-level representations. Additionally, we introduce a new benchmark integrating four datasets, Multimodal Survival Analysis Incremental Learning (MSAIL), for comprehensive evaluation in the CL setting. Extensive experiments demonstrate that ConSurv outperforms competing methods across multiple metrics. Our code is provided in the supplementary material and will be made publicly available upon publication.

ConSurv: Multimodal Continual Learning for Survival Analysis

In the realm of autonomous driving, accurately detecting surrounding obstacles is crucial for effective decision-making. Traditional methods primarily rely on 3D bounding boxes to represent these obstacles, which often fail to capture the complexity of irregularly shaped, real-world objects. To overcome these limitations, we present GUIDE, a novel framework that utilizes 3D Gaussians for instance detection and occupancy prediction. Unlike conventional occupancy prediction methods, GUIDE also offers robust tracking capabilities. Our framework employs a sparse representation strategy, using Gaussian-to-Voxel Splatting to provide fine-grained, instance-level occupancy data without the computational demands associated with dense voxel grids. Experimental validation on the nuScenes dataset demonstrates GUIDE's performance, with an instance occupancy mAP of 21.61, marking a 50% improvement over existing methods, alongside competitive tracking capabilities. GUIDE establishes a new benchmark in autonomous perception systems, effectively combining precision with computational efficiency to better address the complexities of real-world driving environments.

GUIDE: Gaussian Unified Instance Detection for Enhanced Obstacle Perception in Autonomous Driving

Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.

Complex Instruction Following with Diverse Style Policies in Football Games

Diffusion models have recently been adopted for point cloud upsampling due to their effectiveness in solving ill-posed problems. However, existing upsampling methods often struggle with inefficiencies, as they generate dense point clouds by mapping Gaussian noise to data, overlooking the geometric information already present in sparse inputs. To address this, we propose PUFM, a novel Point Cloud Upsampling via Flow Matching, which learns to directly transform sparse point clouds into their high-fidelity dense counterparts. Our approach first applies midpoint interpolation to densify the sparse input. Then, we construct a continuous interpolant between sparse and dense point clouds and train a neural network to estimate the velocity field for flow matching. Given the unordered nature of point clouds, we introduce a pre-alignment step based on Earth Mover's Distance (EMD) optimization to ensure coherent and meaningful interpolation between sparse and dense representations. This results in a more stable and efficient learning trajectory during flow matching. Experiments on synthetic benchmarks demonstrate that our method delivers superior upsampling quality but with fewer sampling steps. Further experiments on ScanNet and KITTI also show that our approach generalizes well to real-world RGB-D and LiDAR point clouds, making it more practical for real-world applications.

PUFM: Efficient Point Cloud Upsampling via Flow Matching

Protein design is revolutionizing biotechnology, yet existing approaches struggle to balance structural foldability with functional performance. Structure-based models excel at generating stable protein backbones but often overlook critical functional properties, while protein language models capture evolutionary and functional signals but frequently predict sequences lacking structural stability. Integrating these complementary approaches remains challenging due to their inherently conflicting objectives.
We present MAProt, a multi-agent framework that synergistically combines structure-based and protein language model-based methods for protein design. Each agent specializes in a distinct aspect of the design objective: the structure-based agent (e.g., ProteinMPNN) ensures compatibility with the target backbone, while protein language model-based agents (e.g., ESM, SaProt) capture evolutionary plausibility and functional potential. To reconcile conflicts and achieve optimal trade-offs, we introduce a Pareto-based negotiation module that enables effective multi-objective coordination and consensus among agents.
Extensive experiments on benchmark datasets demonstrate that MAProt achieves a remarkable improvement over state-of-the-art baselines, and generalizes robustly across a range of tasks, including thermodynamic folding stability design, functional protein design, and high-affinity antibody design. These results highlight the power of collaborative optimization for advancing rational protein engineering.

Advancing Protein Design via Multi-Agent Reinforcement Learning with Pareto-Based Collaborative Optimization

Event cameras are bio-inspired sensors that capture visual information through asynchronous brightness changes, offering distinct advantages including high temporal resolution and wide dynamic range. While prior research has investigated event-based 3D reconstruction for extreme scenarios, existing methods face inherent limitations and fail to fully exploit the unique characteristics of event data.
In this paper, we present EvDiff3D, a novel two-stage 3D reconstruction framework that integrates event-based geometric constraints with an event-aware diffusion prior for appearance refinement. Our key insight lies in bridging the gap between physically grounded event-based reconstruction and data-driven appearance repair through a unified cyclical pipeline. In the first stage, we reconstruct a coarse 3D scene under supervision from event loss and event-based monocular depth constraints to preserve structural fidelity. 
The second stage fine-tunes an event-aware diffusion model based on a pretrained video diffusion model as a repair prior to enhance the appearance in under-constrained regions.
Based on the diffusion model, our pipeline operates within a reconstruction-generation cycle that progressively refines both geometry and appearance using only event data.
Extensive experiments on synthetic and real-world datasets demonstrate that EvDiff3D significantly outperforms existing methods in perceptual quality and structural consistency.

EvDiff3D: Event-Aware Diffusion Repair for High-Fidelity Event-Based 3D Reconstruction

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion.
In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation.
To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Multimodal Retrieval-Augmented Generation (MRAG) has recently been explored to empower Large Vision Language Models (LVLMs) with more comprehensive and up-to-date contextual knowledge, aiming to compensate for their limited and coarse-grained parametric knowledge in knowledge-intensive tasks. 
However, the retrieved contextual knowledge is usually not aligned with LVLMs’ internal parametric knowledge, leading to knowledge conflicts and further unreliable or inconsistent LVLM responses. 
To tackle this issue, we design KCM, a training-free and plug-and-play framework that can effectively mitigate knowledge conflicts while incorporating MRAG for more accurate LVLM responses. 
KCM enhances contextual knowledge utilization by modifying the LVLM architecture from three key perspectives. First, KCM adaptively adjusts attention distributions among multiple attention heads, encouraging LVLMs to focus on contextual knowledge with reduced distraction. 
Second, KCM identifies and prunes knowledge-centric LVLM neurons that encode coarse-grained parametric knowledge, thereby suppressing interferences and enabling more effective integration of contextual knowledge. 
Third, KCM amplifies the information flow from the input context by injecting supplementary context logits, reinforcing its contribution to the final output. 
Extensive experiments over multiple widely adopted LVLMs and benchmarks show that KCM outperforms the state-of-the-art consistently by large margins, incurring neither extra training nor external tools. Code and data will be released.

Content not yet available

Next from AAAI 2026

WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES