Singapore

Large language models (LLMs) present a paradox: they can correctly answer a multi-hop factual query in a high-resource language like English, yet fail on the identical query in another language. This raises a fundamental question about the nature of multilingual knowledge: are facts missing, or merely inaccessible? The underlying mechanisms for this knowledge gap have remained largely unexplored. In this work, we resolve this question by introducing a mechanistic interpretability framework that traces the causal pathways of multi-hop knowledge reasoning. Our analysis reveals a core, non-obvious finding: cross-lingual inconsistencies do not stem from a knowledge deficit. Instead, factual knowledge is robustly stored in a set of **shared, language-agnostic semantic neurons**. The failure originates from **misaligned attention pathways**, where a common set of critical attention heads fails to correctly route information along the reasoning chain to the appropriate knowledge neurons in lower-resource languages. This mechanistic diagnosis motivates a targeted alignment strategy: a surgical fine-tuning of only these critical heads. Experiments demonstrate that our method achieves significant improvements in multilingual multi-hop factuality—with positive cross-lingual transfer—while uniquely preserving general model capabilities, offering a scalable and mechanistically-grounded approach to building more reliable multilingual models.

AAAI 2026

Bridging the Language Gap: Uncovering and Aligning Shared Circuits for Multi-Hop Reasoning in Multilingual LLMs

mechanistic interpretability

cross-lingual alignment

large language models

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit~can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather.

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Deep reinforcement learning has proven to be a powerful approach to solving control tasks, but its characteristic high‑frequency oscillations make it difficult to apply in real‑world environments.
While prior methods have addressed action oscillations via architectural or loss-based methods, the latter typically depend on heuristic or synthetic definitions of state similarity to promote action consistency, which often fail to accurately reflect the underlying system dynamics.
In this paper, we propose a novel loss-based method by introducing a transition-induced similar state.
The transition-induced similar state is defined as the distribution of next states transitioned from the previous state.
Since it utilizes only environmental feedback and actually collected data, it better captures system dynamics.
Building upon this foundation, we introduce Action Smoothing by Aligning Actions with Predictions from Preceding States (ASAP), an action smoothing method that effectively mitigates action oscillations. 
ASAP enforces action smoothness by aligning the actions with those taken in transition-induced similar states and by penalizing second-order differences to suppress high-frequency oscillations.
Experiments in Gymnasium and Isaac-lab environments demonstrate that ASAP yields smoother control and improved policy performance over existing methods.

Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States

Node-level federated graph clustering allows multiple unlabeled subgraph holders to collaboratively train on node-level tasks without sharing private information. Existing methods usually assume that the node attributes are complete and have achieved promising progress. However, in the Federated Graph Learning (FGL) scenarios, this assumption is overly strict due to failures in data collection devices. Consequently, most existing FGL frameworks struggle to extract useful features from attribute-incomplete graphs for clustering, yet the issue remains underexplored. To bridge this gap, we propose a causally-aware attribute completion for **I**ncomplete **Fed**erated **G**raph **C**lustering (IFedGC), which constructs a reliable global causal structure that incorporates clustering-friendly information to guide attribute completion for each subgraph. Specifically, in the attribute completion step, we first construct the causal structure to extract the causal relationships between initialized features, and then upload them to the server. Subsequently, we integrate multiple uploaded causal structures into a global causal one to achieve cross-client attribute completion. Moreover, to support reliable clustering, we first collect the high-confidence cluster centroids from each subgraph using a Graph Neural Network (GNN) model and subsequently aggregate these centroids on the server. The above two steps are seamlessly integrated into a unified FGL framework to obtain a clustering-oriented causal structure, which is sent back to the client to promote high-quality attribute completion for better clustering. Extensive results on five benchmark datasets demonstrate the effectiveness and superiority of IFedGC against its competitors.

Causally-Aware Attribute Completion for Incomplete Federated Graph Clustering

Unsupervised domain adaptive pose estimation is a fundamental yet challenging task due to the need to transfer from labeled synthetic data to unlabeled real data. Nevertheless, the underlying pose semantics, which are governed by spatial structure, remain largely consistent across domains. This observation motivates the use of vision-language models, which provide domain-invariant representations that align well with high-level semantic concepts. Motivated by this, we propose CLIP2Pose, a novel framework that leverages the semantic robustness of frozen CLIP encoders to facilitate cross-domain generalization. We first introduce a semantic-driven prompt mechanism that encodes structural priors, domain-specific appearance, and instance-level context into the image representation. This guides the model to focus on semantically meaningful and structurally relevant features. Next, we propose a semantic modulation module that adaptively refines visual features by conditioning them on prompt-derived embeddings, enhancing alignment between semantics and visual patterns. To further bridge the modality and domain gaps, we design a directional alignment loss that encourages consistent structural reasoning across both vision and language representations. Extensive experiments on domain adaptive human body and hand pose benchmarks show that CLIP2Pose achieves state-of-the-art performance.

CLIP2Pose: Frozen CLIP as Semantic Guide for Domain Adaptive Pose Estimation

Appearance editing according to user needs is a pivotal task in video editing. Existing text-guided methods often lead to ambiguities regarding user intentions and restrict fine-grained control over editing specific aspects of objects. To overcome these limitations, this paper introduces a novel approach named \textit{Zero-to-Hero}, which focuses on reference-based video editing by disentangling the editing process into two distinct problems. It achieves this by first editing an anchor frame to satisfy user requirements as a reference image and then consistently propagating its appearance across the other frames in the video.
To achieve accurate appearance propagation, in the first stage of \textit{Zero-to-Hero}, we leverage correspondences within the original frames to guide the attention mechanism, which is more robust than previously proposed optical flow or temporal modules in memory-friendly video generative models, especially when dealing with objects exhibiting large motions. This offers a solid \underline{ZERO}-shot initialization that ensures both accuracy and temporal consistency. However, intervention in the attention mechanism results in compounded imaging degradation with unknown blurring and color-missing issues. Following the Zero-Stage, our Hero-Stage \underline{H}olistically learns a conditional generative model for vid\underline{E}o \underline{R}est\underline{O}ration.
To accurately evaluate appearance consistency, we construct a set of videos with multiple appearances using Blender, enabling a fine-grained and deterministic evaluation. Our method outperforms the best-performing baseline with a PSNR improvement of 2.6 dB.

Zero-to-Hero: Empowering Video Appearance Transfer with Zero-Shot Initialization and Holistic Restoration

Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features. We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios—even surpassing supervised techniques.

Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Despite significant advancements in image generation using advanced generative frameworks, cross-image integration of content and style remains a key challenge. Current generative models, while powerful, frequently depend on vague textual prompts to define styles—creating difficulties in balancing content semantics and style preservation. We propose a novel framework that utilizes customized models to learn style representations. It enhances content preservation through cross-model feature and attention modulation, leveraging the inherent semantic consistency across models. Additionally, we introduce fixed feature and adaptive attention fusion to achieve the desired balance between content and style. We further develop spatial (mask-guided localized) and temporal (multi-style compositional) multi-model combinations, enabling flexible fusion of models and styles. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in balancing content preservation and stylistic coherence.

An Efficient and Harmonized Framework for Balanced Cross-Domain Feature Integration

Flow Matching (FM) is an efficient generative modeling framework, but aligning it with human preferences remains underexplored.~Although applying Direct Preference Optimization (DPO) to diffusion models has yielded improvements, directly extending DPO-like methods to FM poses three challenges: 1) Incompatibility with ODE-based models, 2) Heavy computational cost from full model fine-tuning, and 3) Reliance on reference model quality. To address these limitations, we propose Preference Classifier for Flow Matching (PC-Flow), a novel reference-free preference alignment framework. Specifically, we reinterpret FM’s deterministic ODE as an equivalent SDE to enable DPO-style learning. Then, we introduce a lightweight classifier to model relative preferences exclusively. This approach decouples alignment from the generative model, eliminating the need for costly fine-tuning or a reference model. Theoretically, PC-Flow guarantees consistent preference-guided distribution evolution, achieves a DPO-equivalent objective without a reference model, and progressively steers generation toward preferred outputs. Experiments show that PC-Flow achieves DPO-level alignment with significantly lower training costs.

PC-Flow: Preference Alignment in Flow Matching via Classifier

Keyframe selection has become essential for video understanding with vision-language models (VLMs) due to limited input tokens and the temporal sparsity of relevant information across video frames. Video understanding often relies on effective keyframes that are not only informative but also causally decisive. To this end, we propose Reinforced Causal Search with Information Bottleneck (ReaSon), a framework that formulates keyframe selection as an optimization problem with the help of a novel Causal Information Bottleneck (CIB), which explicitly defines keyframes as those satisfying both predictive sufficiency and causal necessity. Specifically, ReaSon employs a learnable policy network to select keyframes from a visually relevant pool of candidate frames to capture predictive sufficiency, and then assesses causal necessity via counterfactual interventions. Finally, a composite reward aligned with the CIB principle is designed to guide the selection policy through reinforcement learning. Extensive experiments on NExT-QA, EgoSchema, and Video-MME demonstrate that ReaSon consistently outperforms existing state-of-the-art methods under limited-frame settings, validating its effectiveness and generalization ability.

ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding

Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules.

Downloads

Next from AAAI 2026

WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES