Singapore

Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7\% EM@1 improvement and a 78.2\% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.

AAAI 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

cv: scene analysis & understanding

cv: 3d computer vision

cv: language and vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advancements in personalized Text-to-Video (T2V) generation have made significant strides in synthesizing character-specific content. However, these methods face a critical limitation: the inability to perform fine-grained control over motion intensity. This limitation stems from an inherent entanglement of action semantics and their corresponding magnitudes within coarse textual descriptions, hindering the generation of nuanced human videos and limiting their applicability in scenarios demanding high precision, such as animating virtual avatars or synthesizing subtle micro-expressions. Furthermore, existing approaches often struggle to preserve high identity fidelity when other attributes are modified. To address these challenges, we introduce MotionCharacter, a framework for high-fidelity human video generation with precise motion control. At its core, MotionCharacter explicitly decouples motion into two independently controllable components: action type and motion intensity. This is achieved through two key technical contributions: (1) a Motion Control Module that leverages textual phrases to specify the action type and a quantifiable metric derived from optical flow to modulate its intensity, guided by a region-aware loss that localizes motion to relevant subject areas; and (2) an ID Content Insertion Module coupled with an ID-Consistency loss to ensure robust identity preservation during dynamic motions. To facilitate training for such fine-grained control, we also curate Human-Motion, a new large-scale dataset with detailed annotations for both motion and facial features. Extensive experiments demonstrate that MotionCharacter achieves substantial improvements over existing methods. Our framework excels in generating videos that are not only identity-consistent but also precisely adhere to specified motion types and intensities. The code, dataset and models will be made publicly available upon acceptance.

MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

Spatial transcriptomics provides unprecedented opportunities to analyze gene expression patterns while preserving spatial tissue architecture. However, traditional deep learning methods face significant challenges in multi-modal data integration, spatial dependency modeling, and biological knowledge incorporation, while existing large language models (LLMs) lack explicit spatial modeling capabilities for transcriptomic data. To address these limitations, we present ST-LLM (Spatial Transcriptomics Embedding with Large Language Models), a novel approach that transforms complex spatial graph structures into structured textual representations suitable for LLMs through innovative prompt engineering. ST-LLM features three key components: dynamic graph adjacency construction using reinforcement learning to adaptively optimize spatial relationships, graph-to-text conversion that creates hierarchical descriptions with spatial context, and comprehensive utilization of pre-trained semantic understanding to generate high-dimensional spatial-aware embeddings. Comprehensive experiments on 14 datasets demonstrate that ST-LLM consistently outperforms state-of-the-art methods in spatial domain clustering and region detection tasks. Our framework establishes LLM embeddings as a simple yet powerful paradigm for encoding spatial transcriptomics biological knowledge, opening new avenues for computational spatial biology research.

ST-LLM: Spatial Transcriptomics Embedding with Large Language Models

Experimental design is critical for evidence-based decision-making in healthcare, marketing, and public policy. However, designing efficient experiments across heterogeneous subgroups presents significant challenges. Existing methods often optimize for statistical power or overall sample efficiency, overlooking crucial fairness considerations across these different subgroups. To address this gap, we introduce a Fairness-Aware Contextual Track-and-Stop Design (F-CTSD) algorithm. The proposed F-CTSD algorithm provides statistical guarantees on subgroup fairness while minimizing required sample sizes. We quantify the fairness-efficiency trade-off and derive the sample complexity bound for the proposed F-CTSD algorithm under its fairness constraints. We further theoretically prove that the proposed F-CTSD algorithm consistently produces accurate treatment effect estimates even under fairness requirements, enhancing statistical reliability. Numerical experiments show that the proposed F-CTSD algorithm outperforms existing methods, achieving higher sample efficiency while reducing subgroup fairness violations by 4.95\%.

Fairness-Aware Design for Contextual Experiments: Guaranteeing Reliability and Equity in Heterogeneous Subgroups

Real-world knowledge graphs (KGs) contain not only standard triple-based facts, but also more complex, heterogeneous types of facts, such as hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts that imply relationships between facts. These richer forms of representation have attracted significant attention due to their enhanced expressiveness and capacity to model complex semantics in real-world scenarios. However, most existing studies suffer from two main limitations: (1) they typically focus on modeling only specific types of facts, thus making it difficult to generalize to real-world scenarios with multiple fact types; and (2) they struggle to achieve generalizable hierarchical (inter-fact and intra-fact) modeling due to the complexity of these representations. To overcome these limitations, we propose UniHR, a Unified Hierarchical Representation learning framework, which consists of a learning-optimized Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing both semantic information within individual facts and enriching the structural information between facts. To go beyond the unified method itself, we further explore the potential of unified representation in complex real-world scenarios. Extensive experiments on 9 datasets across 5 types of KGs demonstrate the effectiveness of UniHR and highlight the strong potential of unified representations. Code and data are available at https://github.com/zjukg/UniHR.

UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction

Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

Multi-person pose estimation in real-world scenarios remains a challenging task due to frequent occlusions, scale variations, and complex human interactions. Existing methods often rely on fixed keypoint association patterns that fail to capture the dynamic and context-dependent nature of human body topologies, leading to misalignment and false detections. In this work, we propose a topology-aware dynamic association framework that adaptively models inter-keypoint relationships conditioned on local context and pose topology. The proposed framework comprises three stages: a human-to-keypoint detection module for coarse localization, a dynamic keypoint association module that learns flexible connectivity patterns between joints, and a fine-grained refinement module for precise pose adjustment. By integrating topological priors into dynamic learning and multi-stage optimization, our proposed method effectively mitigates the issues caused by occlusions and overlapping instances. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, especially in crowded and occlusion-heavy scenes.

Learning Topology-Aware Dynamic Associations for Robust Multi-Person Pose Estimation

The Multi-agent Path Finding (MAPF) problem involves finding collision-free paths for a team of agents in a known, static environment, with important applications in warehouse automation, logistics, or last-mile delivery. To meet the needs of these large-scale applications, current learning-based methods often deploy the same fully trained, decentralized network to all agents to improve scalability. However, such parameter sharing typically results in homogeneous behaviors among agents, which may prevent agents from breaking ties around symmetric conflict (e.g., bottlenecks) and might lead to live-/deadlocks. In this paper, we propose SYLPH, a novel learning-based MAPF framework aimed to mitigate the adverse effects of homogeneity by allowing agents to learn and dynamically select different social behaviors (akin to individual, dynamic roles), without affecting the scalability offered by parameter sharing. Specifically, SYLPH offers a novel hierarchical mechanism by introducing Social Value Orientation (SVO) as a temporally extended latent variable that plays a central role in both policy generation and reward assignment. To support this hierarchical decision-making process, we introduce Social-aware Multi-Policy PPO (SMP3O), a reinforcement learning method that ensures stable and effective training through a mechanism for the cross-utilization of advantages. Moreover, we design an SVO-based learning tie-breaking algorithm, allowing agents to proactively avoid collisions, rather than relying solely on post-processing techniques. As a result of this hierarchical decision-making and exchange of social preferences, SYLPH endows agents with the ability to reason about the MAPF task through more latent spaces and nuanced contexts, leading to varied responses that can help break ties around symmetric conflicts. Our comparative experiments show that SYLPH achieves state-of-the-art performance, surpassing other learning-based MAPF planners in random, room-like, and maze-like maps, while our ablation studies demonstrate the advantages of each component in SYLPH. We finally experimentally validate our trained policies on hardware in three types of maps, showing how SYLPH allows agents to find high-quality paths under real-life conditions. Our code and videos are available at: marmotlab.github.io/mapf_sylph.

Social Behavior as a Key to Learning-based Multi-Agent Pathfinding Dilemmas

Masked Image Modeling (MIM) has been widely recognized as a powerful self-supervised paradigm for learning general-purpose visual representations. However, standard MIM based on random masking tends to underperform in domain-specific tasks like Scene Text Recognition (STR), due to challenges such as information sparsity and appearance discrepancies caused by partial occlusion or distortion. To address this issue, we propose a novel pre-training framework called Appearance Discrepancy-guided Sequence Hybrid Masking (DSHM), specifically designed to learn robust representations for STR. To this end, we introduce an Appearance Discrepancy Metric that quantifies the discrepancy level of each image patch by measuring its deviation from anisotropic local discrepancy and intra-instance global style discrepancy. The resulting discrepancy scores are utilized in two key components: (1) A Sequence Hybrid Masking strategy, which prioritizes masking high-discrepancy patches in coherent block forms, thereby elevating the pretext task from simple pixel-level completion to more complex structural reasoning; (2) Discrepancy-Conditioned Tokens (DC-Tokens), which encode prior knowledge about patch difficulty into the decoder, enabling an adaptive reconstruction process and improving the model robustness under scenarios with partial occlusion or text distortion. We achieve competitive performance on multiple benchmark datasets, including common benchmarks, Union14M benchmarks, and Chinese benchmarks.

Appearance Discrepancy-guided Sequence Hybrid Masking for Robust Scene Text Recognition

Model-heterogeneous federated tuning (MHFT) enables the privacy-preserving fine-tuning of foundation models in heterogeneous systems by allowing clients and the server to adopt different model architectures. Depth partial training—where each client updates only a subset of the model's layers—alleviates system heterogeneity but exacerbates client drift, which stems from clients optimizing different objectives and therefore degrades overall performance. Beyond the well-known statistical bias—where non-IID data leads to client drift—we identify a \textbf{structural bias} arising from clients deploying only partial layers of the global model, which serves as an important cause of drift. We further provide a theoretical analysis showing that the possible range of structural bias expands linearly with the number of missing layers. To counter this effect, we introduce \texttt{FedBRICK} (Federated Bias Recovery via Inserted Calibrative Kernels), which inserts tiny \texttt{BRICK}s into each client’s subnetwork. We employ a dual-end layer-wise distillation scheme to train these blocks using both client-side local data and a small public proxy set on the server. This design effectively mitigates the structural bias caused by layer dropping, reduces client drift, and remains practical for storage-constrained devices. Extensive experiments on federated learning benchmarks confirm that \texttt{FedBRICK} delivers up to a $5\%$ average accuracy gain while requiring no more than $1.44\%$ extra storage per client.

FedBRICK: Structural Bias Aware Heterogeneous Foundation Model Federated Tuning

Digital circuit representation learning has made remarkable progress in electronic design automation, effectively supporting critical tasks such as testability analysis and logic reasoning. However, representation learning for analog circuits remains challenging due to their continuous electrical characteristics compared to the discrete states of digital circuits. This paper presents a direct current (DC) electrically equivalent-oriented analog representation learning framework, named **KCLNet**. We will open-source the dataset and code upon publication. It comprises an asynchronous graph neural network structure with electrically-simulated message passing and a representation learning method inspired by Kirchhoff's Current Law (KCL). This method maintains the orderliness of the circuit embedding space by enforcing the equality of the sum of outgoing and incoming current embeddings at each node, which significantly enhances the generalization ability of circuit embeddings. KCLNet offers a novel and effective solution for analog circuit representation learning with electrical constraints preserved. Experimental results demonstrate that our method achieves significant performance in a variety of downstream tasks, e.g., analog circuit classification, subcircuit detection, and circuit edit distance prediction.

Downloads

Next from AAAI 2026

MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES