Singapore

Point cloud quality assessment (PCQA) is essential for reliable 3D visual applications. While point-based methods face challenges in characterizing distortions due to point cloud disorder, projection-based approaches offer better efficiency but suffer from geometric distortion insensitivity and texture representation blind spots. This study proposes SAF-Net, a multi-view structure-aware feature fusion network for PCQA. We first identify two key limitations in projection-based methods: insufficient geometric distortion perception and representation blind spots (RBS) in texture images. To address these issues, SAF-Net innovatively integrates object mask maps and local binary pattern (LBP) maps. The mask maps enhance geometric distortion perception by extracting edge sharpness and curvature variations, while LBP maps capture essential structural information to overcome RBS and align with human visual system (HVS) sensitivity. SAF-Net employs a hybrid CNN-ViT architecture to balance local feature extraction and global context modeling, along with a progressive fusion strategy to optimize cross-modal feature interaction. Extensive experiments demonstrate the superior performance of SAF-Net on multiple benchmarks, establishing new state-of-the-art results in PCQA.

AAAI 2026

Point Cloud Quality Assessment via Multi-View Structure-Aware Feature Fusion

structure-aware

quality assessment

point cloud

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We present TrackGS, the first method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. While 3DGS delivers impressive rendering quality, its reliance on accurate precomputed camera parameters remains a significant limitation. Existing COLMAP-free approaches depend on local constraints that fail in complex scenarios. Our key innovation lies in leveraging feature tracks to establish global geometric constraints, enabling simultaneous optimization of camera parameters and 3D Gaussians. Specifically, we: (1) introduce track-constrained Gaussians that serve as geometric anchors, (2) propose novel 2D and 3D track losses to enforce multi-view consistency, and (3) derive differentiable formulations for camera intrinsics optimization. Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance, with much lower pose error than previous methods while maintaining superior rendering quality. Our approach eliminates the need for COLMAP preprocessing, making 3DGS more accessible for practical applications.

TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints

Behavior trees (BTs) are becoming a popular control architecture for robots, featuring modularity, reactivity, and robustness. BT planning, an emerging approach, provides a theoretical guarantee to generate reliable BTs for achieving tasks automatically. However, BT planning assumes that a well-designed BT system has already been grounded, including high-level action models and low-level control policies, which often requires expensive expert knowledge and effort in the specific domain. In this paper, we define the BT grounding problem, where an algorithm needs to automatically construct a complete and consistent BT system for a given task set. We demonstrate a naive algorithm which is sound and complete for solving this problem, but difficult to be implemented due to the exponential complexity. Then, we propose the first framework for efficiently solving the BT grounding problem, named Context-Aware Behavior Tree grOunding (CABTO). CABTO mainly utilizes pre-trained Large Models (LMs) to heuristically search the space of action models and control policies based on the contexts of BT planning and environmental feedback. Experiments on 3 robot manipulation task sets, involving a total of 15 tasks across different scenarios and robots, demonstrate CABTO’s effectiveness and efficiency in generating complete and consistent behavior tree systems.

CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

Self-training large language models (LLMs) with generated reasoning paths has emerged as a promising approach to improve performance on complex reasoning tasks. However, most existing methods rely on correctness-based supervision, treating samples that reach the correct answer as high-quality despite potentially flawed intermediate steps, leading to noisy training signals. In this work, we propose K-STaR (Knowledge-aware Self-Taught Reasoner), a self-training framework that verifies reasoning paths through knowledge elicitation and integration as a proxy, without requiring any external reward models or dense step-by-step annotations. K-STaR models reasoning as a structured composition of knowledge units and automatically assigns process rewards to intermediate steps via consistency and frequency analysis, ensuring that only knowledge-grounded reasoning paths are retained. Experiments on mathematical and commonsense reasoning tasks show that K-STaR consistently discovers higher-quality reasoning paths and achieves superior self-training performance compared to prior methods. Our results highlight the importance of moving beyond correctness-centric supervision toward knowledge-grounded self-improvement.

K-STaR: Knowledge-Aware Self-Taught Reasoner

In cooperative video games, traditional AI companions are deployed to assist players, who control them using hotkeys or command wheels to issue predefined commands such as ''attack'', ''defend'', or ''retreat''. Despite their simplicity, these methods, which lack target specificity, limit players' ability to give complex tactical instructions and hinder immersive gameplay experiences. To address this, we propose the FPS AI Companion who Understands Language (F.A.C.U.L.), the first real-time AI system that enables players to communicate and collaborate with AI companions using natural language. By integrating natural language processing with a confidence-based framework, F.A.C.U.L. efficiently decomposes complex commands and interprets player intent. It also employs a dynamic entity retrieval method for environmental awareness, aligning human intentions with decision-making. Unlike traditional rule-based systems, our method supports real-time language interactions, enabling players to issue complex commands such as ''clear the second floor,'' ''take cover behind that tree,'' or ''retreat to the river''. The system provides real-time behavioral responses and vocal feedback, ensuring seamless tactical collaboration. Using the popular FPS game Arena Breakout: Infinite as a case study, we present comparisons demonstrating the efficacy of our approach and discuss the advantages and limitations of AI companions based on real-world user feedback.

F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming

Accurate forecasting of multivariate time series data remains a formidable challenge, particularly due to the growing complexity of temporal dependencies in real-world scenarios. While neural network-based models have achieved notable success in this domain, complex channel-dependent models often suffer from performance degradation compared to channel-independent models that do not consider the relationship between components but provide high robustness due to small capacity. In this work, we propose HN-MVTS, a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model. The input of this hypernetwork is a learnable embedding matrix of time series components. To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks, serving as a data-adaptive regularizer that improves generalization and long-range predictive accuracy. The hypernetwork is used only during the training, so it does not increase the inference time compared to the base forecasting model. Extensive experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) consistently improves their performance.

HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting

The proliferation of facial recognition (FR) systems has raised privacy concerns in the digital realm, as malicious uses of FR models pose a significant threat. Traditional countermeasures, such as makeup style transfer, have suffered from low transferability in black-box settings and limited applicability across various demographic groups, including males and individuals with darker skin tones. To address these challenges, we introduce a novel facial privacy protection method, dubbed MAP, a pioneering approach that employs human emotion modifications to disguise original identities as target identities in facial images. Our method uniquely fine-tunes a score network to learn dual objectives, target identity and human expression, which are jointly optimized through gradient projection to ensure convergence at a shared local optimum. Additionally, we enhance the perceptual quality of protected images by applying local smoothness regularization and optimizing the score matching loss within our network. Empirical experiments demonstrate that our innovative approach surpasses previous baselines, including noise-based, makeup-based, and freeform attribute methods, in both qualitative fidelity and quantitative metrics. Furthermore, MAP proves its effectiveness against an online FR API and shows advanced adaptability in uncommon photographic scenarios. Our code is available at: https://anonymous.4open.science/r/MAP-REVIEW.

Machine Pareidolia: Protecting Facial Image with Emotional Editing

Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

Despite significant advancements in general AI, its effectiveness in the medical domain is limited by the lack of specialized medical knowledge. 
To address this, we formulate GMAI-VL-5.5M, a multimodal medical dataset created by converting hundreds of specialized medical datasets with various annotations into high-quality image-text pairs. 
This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. 
Building upon this dataset, we develop GMAI-VL, a general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. 
This approach significantly improves the model's ability to process multimodal data, supporting accurate diagnoses and clinical decision-making. 
Experiments show that GMAI-VL achieves state-of-the-art performance across various multimodal medical tasks, including visual question answering and medical image diagnosis.

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and a Comprehensive Multimodal Dataset Towards General Medical AI

Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

Language Model Distillation: A Temporal Difference Imitation Learning Perspective

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions, with applications in augmented reality and embodied AI. Existing zero-shot approaches leverage 2D vision–language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite visual inputs such as specified-view renderings or video sequences with overlaid object markers. However, this VLM~$\oplus$~SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial–semantic relationships effectively.
In this work, we propose a new VLM~$\otimes$~SI paradigm that externalizes the 3D SI into a form that enables the VLM to incrementally retrieve only what it needs during its reasoning process.
We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages:
(i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it makes spatial–semantic relationships easier to exploit and lowers the VLM's reasoning difficulty; and
(ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG.
Extensive experiments demonstrate the effectiveness of the proposed VLM~$\otimes$~SI paradigm and show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

Content not yet available

Next from AAAI 2026

TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES