Singapore

Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal. This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain-of-User-Thought (COUT, a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. The main challenges of achieving COUT include: 1) the definition of embodied personalized tasks, 2) the embodied environment epitomizes personalized preference, and 3) the way to model embodied personalized actions. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as: 1) interacting with GUI to access an item pool, 2) generating users&#39; explicit requirements implied by previous actions, and 3) recommending items to fulfill users&#39; implicit requirements. To demonstrate SmartAgent&#39;s capabilities, we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks. Our data and code are available at https://github.com/tsinghua-fib-lab/SmartAgent.

AAAI 2026

SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World

hai: applications

hai: human-aware planning and behavior prediction

Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal. This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain-of-User-Thought (COUT, a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. The main challenges of achieving COUT include: 1) the definition of embodied personalized tasks, 2) the embodied environment epitomizes personalized preference, and 3) the way to model embodied personalized actions. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as: 1) interacting with GUI to access an item pool, 2) generating users' explicit requirements implied by previous actions, and 3) recommending items to fulfill users' implicit requirements. To demonstrate SmartAgent's capabilities, we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks. Our data and code are available at https://github.com/tsinghua-fib-lab/SmartAgent.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Embodied visual navigation remains a challenging task, as agents must explore unknown environments with limited knowledge. Existing zero-shot studies have shown that incorporating memory mechanisms to support goal-directed behavior can improve long-horizon planning performance. However, they overlook visual frontier boundaries, which fundamentally dictate future trajectories and observations, and fall short of inferring the relationship between partial visual observation and navigation goals. In this paper, we propose Semantic Cognition Over Potential-based Exploration (SCOPE), a zero-shot framework that explicitly leverages frontier information to drive potential-based exploration, enabling more informed and goal-relevant decisions. SCOPE estimates exploration potential with a Vision-Language Model and organizes it into a spatio-temporal potential graph, capturing boundary dynamics to support long-horizon planning. In addition, SCOPE incorporates a self-reconsideration mechanism that revisits and refines prior decisions, enhancing reliability and reducing overconfident errors. Experimental results on two diverse embodied navigation tasks show that SCOPE outperforms state-of-the-art baselines by 4.6\% in accuracy. Further analysis demonstrates that its core components lead to improved calibration, stronger generalization, and higher decision quality.

Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two fundamental bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it difficult to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatio-temporal coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments demonstrate that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness. Our code, dataset, and model will be released.

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Large language models (LLMs) have been increasingly applied across a wide range of domains. However, recent studies have identified the presence of certain glitch tokens in their vocabularies, which can trigger hallucinations and lead to unpredictable or even harmful outputs. While various methods have been proposed to detect such tokens, effectively repairing them remains a key challenge for ensuring the reliability of LLMs. In this work, we propose GlitchCleaner, a lightweight yet effective approach to mitigate the adverse effects caused by glitch tokens. GlitchCleaner introduces auxiliary branches into specific components within selected layers of the model, enabling efficient and targeted token repair. These branches are implemented using the low-rank adaptation (LoRA) technique, adding less than 0.1\% additional parameters to the original model. Furthermore, a gating mechanism dynamically controls the activation of these branches based on the model’s input, ensuring precise intervention without disrupting normal inference behavior. Experimental results across multiple mainstream models demonstrate that our method achieves an average repair rate of 86.88\%, representing an improvement of over 30\% compared to existing approaches, while ensuring lossless preservation of the model’s baseline capabilities and causing negligible impact on inference speed.

GlitchCleaner: Lightweight Glitch Tokens Repairing by Lossless Gated LoRA in Large Language Models

Current federated-learning models deteriorate under heterogeneous (non-I.I.D.) client data, as their feature representations diverge and pixel- or patch-level objectives fail to capture the global topology which is essential for high-dimensional visual tasks. We propose **FedTopo**, a framework that integrates **Topological-Guided Block Screening (TGBS)** and **Topological Embedding (TE)** to leverage topological information, yielding coherently aligned cross-client representations by **Topological Alignment Loss (TAL)**. 

First, **Topology-Guided Block Screening (TGBS)** automatically identifies the most topology-informative block by selecting the layer with the highest topological separability, i.e., whose persistence-based signatures best distinguish within- versus between-class pairs, ensuring that subsequent analysis focuses on topology-rich features. Next, this block yields a compact **Topological Embedding**, which quantifies the topological information for each client. Finally, a **Topological Alignment Loss (TAL)** guides clients to maintain topological consistency with the global model during optimization, reducing representation drift across rounds.

Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 under four non-I.I.D. partitions show that **FedTopo** accelerates convergence and improves accuracy over strong baselines. Code is available in Supplementary Materials.

FedTopo: Topology-Informed Representation Alignment in Federated Learning Under Non-I.I.D. Conditions

Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. 
However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes **selecting**, **reusing**, and **enhancing** pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.

Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

In this paper, we develop a novel local graph pooling method, namely the Separated Subgraph-based Hierarchical Pooling (SSHPool), for graph classification. We commence by assigning the nodes of a sample graph into different clusters, resulting in a family of separated subgraphs. We individually employ the local graph convolution units as the local structure to further compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. Since these subgraphs are separated by different clusters and the structural information cannot be propagated between them, the local convolution operation can significantly avoid the over-smoothing problem caused by message passing through edges in most existing Graph Neural Networks (GNNs). By hierarchically performing the proposed procedures on the resulting coarsened graph, the proposed SSHPool can effectively extract the hierarchical global features of the original graph structure, encapsulating rich intrinsic structural characteristics. Furthermore, we develop an end-to-end GNN framework associated with the SSHPool module for graph classification. Experimental results demonstrate the superior performance of the proposed model on real-world datasets. The link of our code is on https://anonymous.4open.science/r/SSHPool-FB16.

SSHPool: The Separated Subgraph-based Hierarchical Pooling

Task-specific data selection, which aims to identify the most relevant training instances from a large corpus to optimize performance on a target task, is a critical challenge in modern AI. Prevailing methods typically rely on either representation clustering or gradient-based influence estimation. However, these approaches have notable limitations. Representation-based methods rely on static features; they measure semantic proximity but are agnostic to the process of learning. Conversely, influence-based methods, while capturing optimization directions, often focus narrowly on aligning with the validation loss, which may not fully correlate with the desired capabilities. To address these issues, we propose TRACE, a novel algorithm that simultaneously considers data consistency in the optimization direction and representation space, and performs TRajectory-based Activation Change Estimation to select instruction. Specifically, TRACE first performs a targeted weight update using the validation set. It then captures the optimization trajectory by calculating the change in neuron activations for each before and after this update. By selecting data whose activation change are most similar to those of the validation set, TRACE ensures alignment in both the representational and optimization domains. Our experiments demonstrate that TRACE outperforms baseline methods across various tasks, particularly in complex, data-scarce scenarios.

TRACE: Trajectory-based Activation Change Estimation for Task-specific Data Selection

Understanding how localized changes in one variable affect others in multivariate time series is essential for diagnostics and decision-making in complex systems. Existing models often fail to capture realistic inter-feature dynamics when simulating "what-if" scenarios, leading to inaccurate or uncorrelated reconstructions. We propose CFORVAE, a variational autoencoder framework that explicitly addresses this limitation by combining temporal decomposition with frequency-domain feature correlation modeling. Our architecture uses a dual-path encoding of trend and seasonal components, each projected into attention-pooled latent spaces, and applies Fourier Neural Operators (FNO) to capture cross-feature dependencies in the spectral domain. This decomposition-correlation design enables component-specific latent manipulation and ensures that local modifications propagate realistically across correlated variables. Through extensive experiments, we show that CFORVAE outperforms state-of-the-art baselines in preserving temporal and feature-level dependencies, especially under adjustment-based reconstructions, making it a powerful tool for interpretable "what-if" analysis and diagnostics.

Intervention-Aware Time Series Modeling: Capturing and Evaluating Feature Dependencies

Metasurfaces are ultrathin, engineered materials composed of nanostructures that manipulate light in ways unattainable by natural materials. Recent advances have leveraged computational optimization, machine learning, and deep learning to automate their design. However, existing approaches exhibit two fundamental limitations: (1) they often restrict the model to generating only a subset of design parameters, and (2) they rely on heavily downsampled spectral targets, which compromises both the novelty and accuracy of the resulting structures. The core challenge lies in developing a generative model capable of exploring a large, unconstrained design space while precisely capturing the intricate physical relationships between material parameters and their high-resolution spectral responses. In this paper, we introduce ​MetaDiT, a novel framework for high-fidelity metasurface design that addresses these limitations. Our approach leverages a robust spectrum encoder pretrained with contrastive learning, providing strong conditional guidance to a Diffusion Transformer-based backbone. Experiments demonstrate that MetaDiT outperforms existing baselines in spectral accuracy, we further validate our method through extensive ablation studies. Our code and model weights will be open-sourced to facilitate future research.

MetaDiT: Enabling Fine-grained Constraints in High-degree-of Freedom Metasurface Design

Recent advances in the field of sequential recommendation have highlighted the potential of Large Language Models (LLMs) in enhancing item embeddings and improving user understanding. However, existing approaches face three major limitations: 1) insufficient understanding of the reasons behind users' purchase decisions, 2) the high-dimensional embeddings directly produced by LLMs are not well compatible with traditional low-dimensional ID embeddings and 3) reliance on additional fine-tuning and high inference overhead to adapt LLMs to the recommendation task. In this paper, we propose MoMoREC, a simple yet effective user-understanding-based recommendation strategy. This method leverages the intrinsic comprehension capabilities of LLMs combined with residual semantic IDs to better understand users. Specifically, starting from common user purchasing behaviors and incorporating item characteristics, we employ a multi-agent framework to utilize LLMs in analyzing user shopping motivations and extracting high-dimensional dense embeddings. These embeddings are then transformed into low-dimensional IDs using a residual semantic ID approach via clustering and residual dimensionality reduction, which can be fed into the recommendation model. MoMoREC effectively integrates the understanding power of LLMs with the strengths of recommendation systems, preserving rich semantic language embeddings while reducing or eliminating the need for auxiliary trainable modules. As a result, it seamlessly adapts to any sequential recommendation framework. Experiments on three benchmark datasets show that MoMoRec significantly improves traditional recommendation models, demonstrating its effectiveness and flexibility.

Content not yet available

Next from AAAI 2026

Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES