Singapore

Multi-modal entity alignment aims to identify equivalent entities between two multi-modal Knowledge graphs by integrating multi-modal data, such as images and text, to enrich the semantic representations of entities. However, existing methods often overlook the structural contextual information within each modality, making them vulnerable to interference from shallow features. To address these challenges, we propose MyGram, a modality-aware graph transformer with global distribution for multi-modal entity alignment. Specifically, we develop a modality diffusion learning module to capture deep structural contextual information within modalities and enable fine-grained multi-modal fusion. In addition, we introduce a Gram Loss that acts as a regularization constraint by minimizing the volume of a 4-dimensional parallelotope formed by multi-modal features, thereby achieving global distribution consistency across modalities. We conduct experiments on five public datasets. Results show that MyGram outperforms baseline models, achieving a 4.05% improvement in Hits@1 on FBDB15K, 10.25% improvement on FBYG15K, and a 3.75% improvement on DBP15K.

AAAI 2026

MyGram: Modality-aware Graph Transformer with Global Distribution for Multi-modal Entity Alignment

krr: knowledge engineering，krr: knowledge representation languages，krr: knowledge acquisition

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.

Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformers

The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules—relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability.

In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. 

All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions. Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities. Despite their strengths, a substantial gap in task completion performance persists between LLM-based approaches and domain experts, as LLMs inherently struggle to comprehend real-world spatial correlations precisely; additionally, LLM inference can make the decision-making process considerably inefficient. To address these issues, we propose a novel dual-process thinking framework dubbed $R^3$, integrating LLMs’ generalization capabilities with VLN-specific expertise in a zero-shot manner. The framework comprises three core modules: Runner, Ruminator, and Regulator. The Runner is a lightweight transformer-based expert model that ensures efficient and accurate navigation under regular circumstances. The Ruminator employs a multimodal LLM as the backbone and adopts chain-of-thought (CoT) prompting to elicit structured reasoning from the LLM. The Regulator monitors the navigation progress and controls the appropriate thinking mode according to three criteria, integrating Runner and Ruminator harmoniously. Experimental results illustrate that R$^3$ significantly outperforms other state-of-the-art methods, exceeding 3.28% and 3.30% in SPL and RGSPL respectively on the REVERIE benchmark, highlighting the effectiveness of our method in handling challenging VLN tasks.

Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics.
However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. 
To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. 
The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. 
This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity.
Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9\% on the InterHuman dataset and 22\% on InterX.

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Recovering precise surface geometry from corrupted point clouds remains a core challenge in 3D vision. Although existing denoising techniques achieve remarkable success, balancing noise removal with preserving intricate geometric details continues to pose difficulties. A critical limitation of current methods is that their adaptive feature aggregation mechanisms rely heavily on intermediate network features that have not been explicitly regularized, resulting in unstable guidance signals. This instability restricts the capability of the network to optimally differentiate true geometric details from noise.
To overcome this limitation, we propose a novel deep learning framework that explicitly learns structured representations as robust priors to guide feature refinement. Our approach first derives a set of representative local structural primitives from input features by means of a learned codebook. This learned structured representation then serves as a robust conditional signal, directing a subsequent feature fusion mechanism to dynamically aggregate information in a structure-aware manner, thereby more effectively discerning noise and meticulously reconstructing geometric details. Extensive experiments on several benchmarks have demonstrated the superiority of our framework over existing advanced techniques in terms of detail preservation and noise suppression.

Guiding Point Cloud Denoising with Learned Structural Priors

Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose $\underline{R}$e$\underline{F}$ocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Federated Graph Learning (FGL) has emerged as a compelling paradigm for collaboratively training a global model while preserving the privacy of multi-source graphs. Nonetheless, FGL faces a critical challenge of data heterogeneity, where semantic and structural discrepancies across clients significantly degrade its performance. Although existing methods attempt to calibrate client-specific graph distributions during the federated training, they inevitably fall short in aligning the optimization behaviors across clients due to dynamic parameter updates, thereby inducing a bottleneck in generalization improvement. To tackle this challenge, we propose a solution from a new perspective of prior refinement, which seeks to proactively harmonize client graph distributions before the federated training. In particular, we propose a Federated Graph Harmonization (FedGH) framework that exploits the generative strengths of graph diffusion models to perform prior refinement of local graphs. In a nutshell, FedGH designs a conditional diffusion mechanism on each client that synthesizes pseudo-graphs encapsulating both feature and structural priors, thereby facilitating explicit correction of inter-client distributional bias. On the server side, we employ the graph contrastive learning between various client-specific pseudo-graphs to incorporate the global information, subsequently guiding local data reconstruction. Importantly, model-agnostic FedGH can be seamlessly deployed as a plug-and-play module to be easily integrated with existing FGL architectures. Extensive experiments demonstrate that FedGH consistently outperforms state-of-the-art FGL baselines.

Prior Refinement Is Better: Diffusion-Driven Graph Harmonization for Federated Graph Learning

Cross-modal alignment is a crucial task in multimodal learning aimed at achieving semantic consistency between vision and language. This requires that image-text pairs exhibit similar semantics. Traditional algorithms pursue embedding consistency to achieve semantic consistency, ignoring the non-semantic information present in the embedding. An intuitive approach is to decouple the embeddings into semantic and modality components, aligning only the semantic component. However, this introduces two main challenges: (1) There is no established standard for distinguishing semantic and modal information. (2) The modality gap can cause semantic alignment deviation or information loss. To align the true semantics, we propose a novel cross-modal alignment algorithm via Constrained Decoupling and Distribution Sampling (CDDS). Specifically, (1) A dual-path UNet is introduced to adaptively decouple the embeddings, applying multiple constraints to ensure effective separation. (2) A distribution sampling method is proposed to bridge the modality gap, ensuring the rationality of the alignment process. Extensive experiments on various benchmarks and model backbones demonstrate the superiority of CDDS, outperforming state-of-the-art methods by 6.6\%-14.2\%.

Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

Echocardiography and vascular ultrasound are essential for comprehensive cardiovascular assessment, yet manual evaluation and writing reports are labor-intensive, time-consuming, and require expertise from both cardiology and vascular surgery departments. Current automated report generation systems mainly focus on X-ray or CT, often neglecting echocardiographic modalities and critical quantitative parameters like aortic diameter and main pulmonary artery diameter, limiting their clinical utility. Moreover, the interdependence between cardiac and peripheral vascular health necessitates cross-departmental insights, which existing methods fail to incorporate. 
To address these limitations, we first propose the vision-language framework named the Echo-Cardiac-Vascular (ECV) framework, for joint cardiac and vascular ultrasound report generation and parameter measurements. ECV introduces a Mixture-of-Experts vision encoder tailored for distinct ultrasound subtypes, a structured parameter measurement module for accurate quantification, and a cross-modal attention mechanism that generates interpretable, multimodal diagnostic reports. Our framework, trained on 11,276 paired records that achieves high accuracy and fast generation speed, significantly improving diagnostic efficiency, consistency, and cross-disciplinary clinical applicability. Our model and codes will be publicly available.

Unified Mixture-of-Experts Framework for Joint Cardiac and Vascular Ultrasound Analysis and Report Generation

Recent diffusion-based models have significantly improved inpainting quality.
However, existing methods struggle with multi-task inpainting due to conflicting optimization objectives, and current datasets are typically limited to task-specific scenarios, hindering joint training.
To address these challenges, we propose \textbf{MagicPaint}, a unified diffusion-based inpainting model that supports object addition, removal, and unconditional inpainting across both text and image modalities.
MagicPaint semantically decouples operation types and target content by learnable tokens in \textbf{MMToken Module}, effectively reconciling conflicting optimization objectives and enabling robust multi-task, multi-modal inpainting.
Besides, a novel inpainting paradigm named \textbf{MagicMask}, encodes operating intent directly into the mask and applies a mask loss for spatially precise supervision.
In addition, existing inpainting datasets are insufficient for multi-task and multi-modal scenarios, limiting the capability of inpainting models. Thus, we further introduce
a new dataset comprising 2.1M image tuples. It is dedicatedly designed to support diverse inpainting scenarios and significantly improves upon existing datasets, particularly in object removal. 
Through efforts from both model and data perspectives, \textbf{MagicPaint} enables users to operate anything—add, remove or inpaint content which is specified through either text or image modalities in a seamless and unified manner.
Extensive experiments demonstrate that MagicPaint achieves state-of-the-art performance across three key tasks (i.e., text-guided addition, image-guided addition, and object removal) and produces outputs with superior visual consistency and contextual fidelity compared to existing methods. Our code and data will be publicly released.

Downloads

Next from AAAI 2026

Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformers

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads