Singapore

This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: **modality discrepancy**, where existing touch-language models often treat touch as a mere sub-modality of language without further addressing the semantic differences, and **open-ended tactile data scarcity**, where current datasets lack the diversity, open-endedness, and complexity needed for reasoning. To overcome these challenges, we introduce STOLA, a **S**elf-Adaptive **To**uch-**La**nguage framework. STOLA utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show STOLA exhibits competitive performance compared to existing models on the PHYSICLEAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.

AAAI 2026

STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios

large touch-language model

tactile commonsense reasoning

vision-based tactile sensing

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in term-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained term discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.

TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domains

Machine learning models constitute valuable intellectual property, yet remain vulnerable to model extraction attacks (MEA), where adversaries replicate their functionality through black-box queries. Model watermarking counters MEAs by embedding forensic markers for ownership verification. Current black-box watermarks prioritize MEA survival through representation entanglement, yet inadequately explore resilience against sequential MEAs and removal attacks. Our study reveals that this risk is underestimated because existing removal methods are weakened by entanglement. To address this gap, we propose Watermark Removal attacK (WRK), which circumvents entanglement constraints by exploiting decision boundaries shaped by prevailing sample-level watermark artifacts. WRK effectively reduces watermark success rates by ≥88.79% across existing watermarking benchmarks.

For robust protection, we propose Class-Feature Watermarks (CFW), which improve resilience by leveraging class-level artifacts. CFW constructs a synthetic class using out-of-domain samples, eliminating vulnerable decision boundaries between original domain samples and their artifact-modified counterparts (watermark samples). CFW concurrently optimizes both MEA transferability and post-MEA stability. Experiments across multiple domains show that CFW consistently outperforms prior methods in resilience, maintaining a watermark success rate of ≥70.15% in extracted models even under the combined MEA and WRK distortion, while preserving the utility of protected models. https://github.com/ClassFeatureWatermark/ClassFeatureWatermark provides access to our implementation code.

Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks

Multitask genetic programming (MTGP) is one of the primary methods for solving multitask symbolic regression (MTSR), the problem of discovering mathematical expressions for multiple interconnected tasks simultaneously. However, conventional MTGP approaches discard a wealth of valuable knowledge from the population of expressions due to their inherent “winner-take-all” selection criteria. To address this, we introduce MTGP with bidirectional cooperation and consensus-accelerated Shapley analysis (MTGP-BS), a method whose core is a novel post-hoc refinement framework that shifts from selection to synthesis. Our method first employs a consensus-accelerated Shapley analysis to reliably identify important subexpressions by multi-model attribution. Second, to supply this analysis with high-quality candidates, we design a bidirectional subexpression cooperative extraction method to create a refined archive of effective components by improving knowledge transfer and filtering out redundancies. These allow MTGP-BS to synthesize superior expressions by integrating knowledge dispersed throughout the entire population. On diverse MTSR problems, our algorithm statistically outperformed state-of-the-art approaches in 140 out of 160 direct comparisons, with its effectiveness and practical utility further verified by real-world case studies and in-depth ablation analyses. The code is available in the supplementary material.

Post-Hoc Refinement for Multitask Symbolic Regression via Consensus-Accelerated Shapley Analysis

Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal. This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain-of-User-Thought (COUT, a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. The main challenges of achieving COUT include: 1) the definition of embodied personalized tasks, 2) the embodied environment epitomizes personalized preference, and 3) the way to model embodied personalized actions. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as: 1) interacting with GUI to access an item pool, 2) generating users' explicit requirements implied by previous actions, and 3) recommending items to fulfill users' implicit requirements. To demonstrate SmartAgent's capabilities, we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks. Our data and code are available at https://github.com/tsinghua-fib-lab/SmartAgent.

SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World

Embodied visual navigation remains a challenging task, as agents must explore unknown environments with limited knowledge. Existing zero-shot studies have shown that incorporating memory mechanisms to support goal-directed behavior can improve long-horizon planning performance. However, they overlook visual frontier boundaries, which fundamentally dictate future trajectories and observations, and fall short of inferring the relationship between partial visual observation and navigation goals. In this paper, we propose Semantic Cognition Over Potential-based Exploration (SCOPE), a zero-shot framework that explicitly leverages frontier information to drive potential-based exploration, enabling more informed and goal-relevant decisions. SCOPE estimates exploration potential with a Vision-Language Model and organizes it into a spatio-temporal potential graph, capturing boundary dynamics to support long-horizon planning. In addition, SCOPE incorporates a self-reconsideration mechanism that revisits and refines prior decisions, enhancing reliability and reducing overconfident errors. Experimental results on two diverse embodied navigation tasks show that SCOPE outperforms state-of-the-art baselines by 4.6\% in accuracy. Further analysis demonstrates that its core components lead to improved calibration, stronger generalization, and higher decision quality.

Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation

Multimodal Large Language Models (MLLMs) have shown remarkable progress in temporal or spatial localization tasks, but struggle with joint spatio-temporal video grounding (STVG). We identify two fundamental bottlenecks hindering this capability: (1) the sheer number of visual tokens makes long-range and fine-grained visual modeling challenging; (2) generating a long sequence of bounding boxes in text makes it difficult to accurately align each box with its specific video frame. Distinct from prior efforts that rely on attaching complex modules, we argue for a more elegant paradigm that unlocks the inherent potential of MLLMs and leverages their strengths. To this end, we propose \textbf{\textit{SpaceVLLM}}, a MLLM equipped with spatio-temporal video grounding capabilities. Specifically, we propose Spatio-Temporal Aware Queries, interleaved with video frames, to guide the MLLM in capturing both static appearance and dynamic motion features. We further present a lightweight Query-Guided Space Head that maps queries to precise spatio-temporal coordinates, bypassing the need for direct textual coordinate generation and enabling the MLLM to focus on video understanding. To further facilitate research in this area, we propose an automated data synthesis pipeline to construct \textbf{V-STG} dataset, comprising 110K STVG instances. Extensive experiments demonstrate that \textit{SpaceVLLM} achieves the state-of-the-art performance on STVG benchmarks and maintains strong performance on various video understanding tasks, validating our approach's effectiveness. Our code, dataset, and model will be released.

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Large language models (LLMs) have been increasingly applied across a wide range of domains. However, recent studies have identified the presence of certain glitch tokens in their vocabularies, which can trigger hallucinations and lead to unpredictable or even harmful outputs. While various methods have been proposed to detect such tokens, effectively repairing them remains a key challenge for ensuring the reliability of LLMs. In this work, we propose GlitchCleaner, a lightweight yet effective approach to mitigate the adverse effects caused by glitch tokens. GlitchCleaner introduces auxiliary branches into specific components within selected layers of the model, enabling efficient and targeted token repair. These branches are implemented using the low-rank adaptation (LoRA) technique, adding less than 0.1\% additional parameters to the original model. Furthermore, a gating mechanism dynamically controls the activation of these branches based on the model’s input, ensuring precise intervention without disrupting normal inference behavior. Experimental results across multiple mainstream models demonstrate that our method achieves an average repair rate of 86.88\%, representing an improvement of over 30\% compared to existing approaches, while ensuring lossless preservation of the model’s baseline capabilities and causing negligible impact on inference speed.

GlitchCleaner: Lightweight Glitch Tokens Repairing by Lossless Gated LoRA in Large Language Models

Current federated-learning models deteriorate under heterogeneous (non-I.I.D.) client data, as their feature representations diverge and pixel- or patch-level objectives fail to capture the global topology which is essential for high-dimensional visual tasks. We propose **FedTopo**, a framework that integrates **Topological-Guided Block Screening (TGBS)** and **Topological Embedding (TE)** to leverage topological information, yielding coherently aligned cross-client representations by **Topological Alignment Loss (TAL)**. 

First, **Topology-Guided Block Screening (TGBS)** automatically identifies the most topology-informative block by selecting the layer with the highest topological separability, i.e., whose persistence-based signatures best distinguish within- versus between-class pairs, ensuring that subsequent analysis focuses on topology-rich features. Next, this block yields a compact **Topological Embedding**, which quantifies the topological information for each client. Finally, a **Topological Alignment Loss (TAL)** guides clients to maintain topological consistency with the global model during optimization, reducing representation drift across rounds.

Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 under four non-I.I.D. partitions show that **FedTopo** accelerates convergence and improves accuracy over strong baselines. Code is available in Supplementary Materials.

FedTopo: Topology-Informed Representation Alignment in Federated Learning Under Non-I.I.D. Conditions

Vision foundation models (VFMs) have demonstrated remarkable capabilities in learning universal visual representations. 
However, adapting these models to downstream tasks conventionally requires parameter updates, with even parameter-efficient fine-tuning methods necessitating the modification of thousands to millions of weights. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a novel parameter-free fine-tuning method. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes **selecting**, **reusing**, and **enhancing** pre-trained features, offering a new perspective on fine-tuning foundation models. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse more task-irrelevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method in different vision tasks (e.g., image segmentation, depth estimation and image classification). Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces GPU memory overhead.

Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

In this paper, we develop a novel local graph pooling method, namely the Separated Subgraph-based Hierarchical Pooling (SSHPool), for graph classification. We commence by assigning the nodes of a sample graph into different clusters, resulting in a family of separated subgraphs. We individually employ the local graph convolution units as the local structure to further compress each subgraph into a coarsened node, transforming the original graph into a coarsened graph. Since these subgraphs are separated by different clusters and the structural information cannot be propagated between them, the local convolution operation can significantly avoid the over-smoothing problem caused by message passing through edges in most existing Graph Neural Networks (GNNs). By hierarchically performing the proposed procedures on the resulting coarsened graph, the proposed SSHPool can effectively extract the hierarchical global features of the original graph structure, encapsulating rich intrinsic structural characteristics. Furthermore, we develop an end-to-end GNN framework associated with the SSHPool module for graph classification. Experimental results demonstrate the superior performance of the proposed model on real-world datasets. The link of our code is on https://anonymous.4open.science/r/SSHPool-FB16.

Downloads

Next from AAAI 2026

TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domains

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domains

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads