Singapore

Versatile 3D tasks (e.g., generation or editing) distilling Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet-Layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module 3D-AAG constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a semantic guidance tree to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a transformative, universal plugin, significantly enhancing multi-view consistency across a wide range of 3D tasks.

AAAI 2026

Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting

cv: 3d computer vision

cv: multi-modal vision

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Automatic medical report generation can greatly reduce the workload of doctors, but it is often unreliable for real-world deployment. Current methods can write formally fluent sentences but may be factually flawed, introducing serious medical errors known as clinical hallucinations, which make them untrustworthy for diagnosis. To bridge this gap, we introduce \textbf{HiMed-RL}, a Hierarchical Medical Reward Learning Framework designed to explicitly prioritize clinical quality. HiMed-RL moves beyond simple text matching by deconstructing reward learning into three synergistic levels: it first ensures linguistic fluency at the token-level, then enforces factual grounding at the concept-level by aligning key medical terms with expert knowledge, and finally assesses high-level diagnostic consistency at the semantic-level using a specialized LLM verifier. This hierarchical reward is implemented via a \textbf{Human-inspired Dynamic Reward Adjustment}, a strategy which first teaches the model to learn basic facts before progressing to more complex diagnostic reasoning. Experimentally, \textbf{HiMed-3B} achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks, particularly on the latter, with an improvement of \textbf{10.8\%} over the second-best baseline. Our work provides a robust paradigm for generating reports that not only improve fluency but clinical fine-grained quality.

Beyond N-grams: A Hierarchical Reward Learning Framework for Clinically-Aware Medical Report Generation

Pose-agnostic Anomaly Detection (PAD) aims to detect anomalies when the poses of query images are unknown and differ from those in the training set. Therefore, accurately estimating the camera poses for the query images in the test set is critical for this task. Existing query-specific framework methods require re-optimizing a new set of parameters for each query image, limiting their generalization and increasing computational burden. To overcome these limitations, we propose a novel method, Relative Pose Estimation for Pose-agnostic Anomaly Detection (RPE-PAD), which enhances both generalization and efficiency with a query-independent framework. Specifically, we propose a Random View Synthesis Scheme (RVSS) that generates new poses by adding Gaussian perturbations to the original poses, then renders the corresponding views to augment the dataset. To estimate the relative camera pose between two input images, we introduce an Iterative Relative Pose Refinement Network (IRPRN), which incorporates a hierarchical coarse-to-fine refinement strategy. Furthermore, we employ a Multi-Pair Training Strategy (MPTS) to train the proposed IRPRN, leveraging multiple image pairs to expand the relative pose transformation space during training. Extensive experiments demonstrate that our method achieves robust anomaly detection performance while significantly improving inference efficiency.

RPE-PAD: Relative Pose Estimation for Pose-agnostic Anomaly Detection

Large language models (LLMs) have recently empowered multi-agent systems (MAS) to achieve remarkable advances in collaborative reasoning and complex task automation. The effectiveness of these systems fundamentally depends on the design of adaptive **communication graphs**—the underlying workflows that coordinate agent interactions. However, in real-world scenarios, strict privacy constraints often silo data across organizations, and client distributions are highly non-IID, posing major challenges for synthesizing such workflows. In this work, we are **the first to systematically study distributed multi-agent workflow synthesis** under these privacy and heterogeneity constraints, and we introduce the Difficulty-Based Skew (DBS) benchmark to emulate such challenging environments. Drawing inspiration from federated graph learning (FGL)—which has primarily focused on classification over static graphs—we identify a critical gap: existing FGL methods do not address the generative design of communication topologies. We reveal two fundamental obstacles to generative workflow synthesis in this setting: (i) **workflow specialization conflict**, where agents optimized for different task distributions generate incompatible communication patterns that resist meaningful aggregation, and (ii) **structural communication shift**, where locally optimal agent interaction graphs fail to compose into globally coherent multi-agent workflows. To address these challenges, we propose **DAWN**, a federated framework that integrates two key innovations: **Parametric Resonance**, which robustly aggregates heterogeneous local updates via layer-wise SVD-based denoising and alignment, and **Structural Gravity**, which regularizes local workflow generation by penalizing the Fusion Gromov-Wasserstein distance to a set of prototype communication graphs, ensuring global structural coherence without stifling local adaptation. Experiments on the DBS benchmark show that **DAWN** surpasses baselines in global task success and reduces inter-client graph divergence, laying a solid foundation for privacy-preserving, adaptive MAS workflow design in heterogeneous settings.

DAWN: Distributed LLM Multi-Agent Workflow Synthesis

We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth spherical Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. Leveraging the linearity of the agents' reward functions on the Stackelberg manifold, our construct allows the application of linear bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on the learned manifold and establish bounds on the simple regret for learning Stackelberg equilibrium. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi-agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization.

Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

Supervised learning with distributional inputs is a classic area of machine learning, and recently the two-stage sampling setup has received considerable attention. 
In this setting, the inputs (which are probability distributions) are not accessible in the learning phase, but only samples thereof. 
This problem is particularly amenable to kernel-based learning methods, where the distributions or samples are first embedded into a Hilbert space, often using kernel mean embeddings (KMEs), and then a standard kernel method like Support Vector Machines (SVMs) is applied, using a kernel defined on the embedding Hilbert space.
The case of distributional regression has received particular attention and there is by now a substantial body of theory available, including learning rates.
In contrast, the case of distributional classification is considerably less investigated, despite being relevant for applications like learning-based medical screening or causal learning.
Motivated by this, we provide a thorough analysis of classification with distributional inputs in the two-stage sampling setup using SVMs, in particular, establishing consistency and learning rate results.
Furthermore, for SVMs using the hinge loss and Gaussian kernels, we formulate a novel variant of an established noise assumption from the binary classification literature, under which we can establish learning rates.
Our results are formulated in significant generality, with many results also applicable to learning problems other than classification.
Furthermore, some of our technical tools like a new feature space for Gaussian kernels on Hilbert spaces are of independent interest.

Statistical Learning Theory for Distributional Classification

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose, we propose \textbf{M}ulti-\textbf{E}xpert \textbf{M}utual \textbf{L}earning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89\% with Qwen and 11.33\% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Machine unlearning is a newly popularized technique for removing specific training data from a trained model, enabling it to comply with data deletion requests. While it protects the rights of users requesting unlearning, it also introduces new privacy risks. Prior works have primarily focused on the privacy of data that has been unlearned, while the risks to retained data remain largely unexplored. 
To address this gap, we focus on the privacy risks of retained data and, for the first time, reveal the vulnerabilities introduced by machine unlearning under the \textbf{dual-view} setting, where an adversary can query both the original and the unlearned models. From an information-theoretic perspective, we introduce the concept of {privacy knowledge gain} and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. To effectively demonstrate this threat, we propose DVIA, a \textbf{D}ual-\textbf{V}iew \textbf{I}nference \textbf{A}ttack, which extracts membership information on retained data using black-box queries to both models. DVIA eliminates the need to train an attack model and employs a lightweight likelihood ratio inference module for efficient inference. Experiments across different datasets and model architectures validate the effectiveness of DVIA and highlight the privacy risks inherent in the dual-view setting.

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Image clustering is a fundamental task in unsupervised visual learning. While recent self-supervised methods have explored various pretext tasks to generate supervision signals for clustering, they typically depend exclusively on raw images, resulting in insufficient supervision signals that are inherently constrained by limited visual semantics. In this paper, we propose a novel Semantic-Augmented image Clustering (SAC) method, which transcends the inherent limitations of purely visual representations through the integration of external knowledge. Specifically, SAC utilizes Vision-Language pre-trained Models (VLMs) to flexibly generate textual descriptions for each image, providing external semantic cues to supplement the visual information. By integrating both visual and textual information, SAC achieves image clustering through a multi-modal learning framework. To mitigate the negative impact of inaccurate textual information, SAC designs an uncertainty-driven adaptive weighting mechanism that explores both intra-modal and inter-modal neighborhood structures, and incorporates the adaptive weights into intra-modal and inter-modal contrastive learning, which improves the robustness against noisy image-text correspondences. Experiments on several popular datasets demonstrate the superiority of SAC compared to state-of-the-art methods.

Semantic-Augmented Image Clustering via Adaptive Multi-Modal Collaboration

Micro-video label prediction plays a pivotal role on contemporary video-sharing platforms, such as Kwai and Tiktok. The emergence of video content lacking labels presents a formidable challenge for conventional user interest prediction methods. This paper addresses the challenge of micro-video label prediction, particularly for unseen videos, by proposing a zero-shot method called Class Semantic Relation Learning (CSRL). Unlike traditional user interest prediction models, CSRL leverages the pre-trained Large Language Model (LLM) to enhance prediction accuracy for unlabeled videos. The novelty of CSRL lies in its integration of three key components: a raw feature autoencoder, LLM-enhanced features, and a decomposed graph network. The decomposed graph network is specifically designed to disentangle the relationships between labeled and unlabeled videos, offering a significant improvement over previous methods. By fusing hidden topics with LLM-enhanced text, CSRL effectively handles sparse video features. Experiments on large-scale datasets from the Kwai platform show that CSRL achieves state-of-the-art results, with up to 44.64\% improvement in Hit Ratio (HR), highlighting its superiority over existing zero-shot recommendation models in predicting user interests within the user-video network.

Zero-shot Recommendation: Towards Class Semantic Relation Learning for Inferring Labels of Unseen Micro-videos

Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets.

Downloads

Next from AAAI 2026

Beyond N-grams: A Hierarchical Reward Learning Framework for Clinically-Aware Medical Report Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Beyond N-grams: A Hierarchical Reward Learning Framework for Clinically-Aware Medical Report Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads