Singapore

Multi-modal object re-identification (ReID) aims to retrieve specific targets by leveraging complementary cues from different sensing modalities. Despite recent progress, two key challenges remain:
(1) the limited ability to jointly address both modality and viewpoint discrepancies, and
(2) the difficulty of effectively leveraging reliable target-domain data to improve generalization.
To address these challenges, we propose Proxy-driven Test-Time Training (ProxyTTT), a unified framework that enhances both multi-modal identity representation learning and model generalization. During training, we propose a Multi-Proxy Learning (MPL) mechanism to address the representation bias across different views and modalities. MPL disentangles fine-grained modality-specific and modality-common identity proxies as semantic anchors to align identity features across diverse perspectives and sensing modalities. This alignment strategy enables the model to learn robust and discriminative global identity representations under heterogeneous modality conditions.
At test time, to reliably exploit target domain data, we propose Proxy-guided Entropy-based Selective Adaptation (PESA) for test-time training. Specifically, PESA leverages the semantic structure encoded by identity proxies to estimate prediction uncertainty via entropy, and selectively adapts the model using only high-confidence samples. This selective adaptation effectively mitigates the domain shift between training and deployment environments, improving the model’s generalization in real-world scenarios.
Extensive experiments on four public multi-modal ReID benchmarks (RGBNT201, RGBNT100, MSVR310, and WMVeID863) demonstrate the effectiveness of ProxyTTT.

AAAI 2026

ProxyTTT: Proxy-driven Test-Time Training for Multi-modal Re-identification

proxy learning

re-identification

multi-modal

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (e.g., GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates the allowance of Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. Particularly, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage. Our codes will be released soon.

GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

We introduce a new notion of deterministic stable solution for non-cooperative games, termed subsidized equilibrium. It assumes that an amount of money can be used as a pool of subsidies to stabilize a strategy profile that otherwise would not be accepted by (some of) the players. Roughly speaking, for a given amount of money, a strategy profile is a subsidized equilibrium if the total payoff loss incurred by players not playing best-responses does not exceed that amount, i.e., there is enough money to refund all players experiencing a regret. With respect to many other solution concepts in the literature, the notion of subsidized equilibrium has important advantages. Specifically, for a sufficiently high value of money, a subsidized equilibrium always exists and can even be computed in polynomial time; also, existence of an efficient subsidized equilibrium can be guaranteed. Thus, determining for which amounts of money existence, polynomial time computability and efficiency can or cannot be achieved becomes an intriguing question. We provide initial results towards this direction for some widely studied classes of games.

Compensate to Not Deviate: On Subsidised Equilibria

Assessing the strength of arguments is essential for determining the outcomes of any argument-based system. A wide range of semantics has been proposed in the literature. These take as input a set of arguments—each assigned a basic weight and potentially subject to attacks from others—and compute a single strength value for each argument. Despite the diversity of argument types (or schemes), existing semantics apply uniform evaluation criteria across all arguments. In this paper, we advocate for type-dependent evaluations, acknowledging that the impact of attacks can vary across types. Given that many argument-based systems involve heterogeneous types of arguments, we propose a broad family of hybrid semantics that combine distinct base semantics, each tailored to specific argument types. We investigate their theoretical properties, present concrete instances within this family, and examine their computational complexity.

Hybrid Semantics Accounting for Argument Types

First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.

First-Order Representation Languages for Goal-Conditioned RL

Diffusion policies excel at robotic manipulation by naturally modeling multimodal action distributions in high-dimensional spaces. Nevertheless, diffusion policies suffer from diffusion representation collapse: semantically similar observations are mapped to indistinguishable features, ultimately impairing their ability to handle subtle but critical variations required for complex robotic manipulation. To address this problem, we propose D²PPO (Diffusion Policy Policy Optimization with Dispersive Loss). D²PPO introduces dispersive loss regularization that combats representation collapse by treating all hidden representations within each batch as negative pairs. D²PPO compels the network to learn discriminative representations of similar observations, thereby enabling the policy to identify subtle yet crucial differences necessary for precise manipulation. In evaluation, we find that early-layer regularization benefits simple tasks, while late-layer regularization sharply enhances performance on complex manipulation tasks. On RoboMimic benchmarks, D²PPO achieves an average improvement of 22.7% in pre-training and 26.1% after fine-tuning, setting new SOTA results. In comparison with SOTA, the results of real-world experiments on a Franka Emika Panda robot show the excitingly high success rate of our method. The superiority of our method is especially evident in complex tasks.

D²PPO: Diffusion Policy Policy Optimization with Dispersive Loss

We address the problem of energy-optimal pathfinding for electric vehicles (EVs) in large-scale road networks, where energy may be recuperated along paths, introducing negative costs. While traditional routing algorithms assume a known initial energy level, many real-world scenarios require computing optimal paths for all possible initial energy levels, a task known as energy profile search. Existing solutions often rely on complex and computationally demanding profile merging procedures.
In this paper, we propose a novel A*-based energy profile search algorithm that avoids explicit profile merging by applying relaxed dominance rules within a multi-objective search framework. We present four variants of our method and evaluate them on road networks enriched with realistic energy consumption data. Experimental results show that our energy profile A* search performs comparably to conventional energy-optimal A*, which guarantees polynomial-time complexity, while additionally supporting profile queries through a simpler yet efficient solution for large-scale EV routing.

A Fast Heuristic Search Approach for Energy-Optimal Profile Routing for Electric Vehicles

Large Language Models (LLMs) have demonstrated remarkable proficiency in diverse tasks. This success raises a fundamental question in machine composition: Can symbolic music be considered a special form of language that can be jointly modeled with natural language for composition tasks? Recent studies validate that symbolic music can be modeled as a human language, yet composing structured music from partial symbolic inputs through natural language interaction remains underexplored. Even LLMs struggle to generate structurally coherent compositions in such hybrid input-output scenarios, highlighting a fundamental gap that calls for a domain-specific learning paradigm. To this end, we propose Inspiration-to-Structure (IoS), a cognitively inspired framework that enables LLMs to generate structured musical sections from melodic ideas. IoS employs a three-phase process—semantic, structural, and collaborative cognition—and is supported by two key components: (1) a new dataset and construction protocol called Structured Triplet Data (STD), and (2) a training method, Dual-Instance Structural Contrastive Optimization (DiSCO), designed to enhance structural awareness. Experiments show that IoS improves structural coherence by 47.8% and artistic creativity by 21.8% compared to conventional language modeling paradigm, supervised fine-tuning, and even enables smaller LLMs to surpass larger LLMs. These results suggest that symbolic music, while language-like, demands specialized modeling beyond standard language modeling paradigms. IoS enables LLMs to transform music theory knowledge into structured composition, empowering users to compose music interactively via language and advancing toward general creative AI.

Is Symbolic Music a Specific Language? Exploring Inspiration-to-Structure Machine Composition via LLMs

Emotion Recognition in Conversation (ERC) is a crucial task for understanding human emotions and enabling natural human-computer interaction. Although Large Language Models (LLMs) have recently shown great potential in this field, their ability to capture the intrinsic connections between explicit and implicit emotions remains limited. We propose a novel ERC training framework, PRC-Emo, which integrates Prompt engineering, demonstration Retrieval, and Curriculum learning, with the goal of exploring whether LLMs can effectively perceive emotions in conversational contexts. Specifically, we design emotion-sensitive prompt templates based on both explicit and implicit emotional cues to better guide the model in understanding the speaker’s psychological states. We construct the first dedicated demonstration retrieval repository for ERC, which includes training samples from widely used datasets, as well as high-quality dialogue examples generated by LLMs and manually verified. Moreover, we introduce a curriculum learning strategy into the LoRA fine-tuning process, incorporating weighted emotional shifts between same-speaker and different-speaker utterances to assign difficulty levels to dialogue samples, which are then organized in an easy-to-hard training sequence. Experimental results on two benchmark datasets—IEMOCAP and MELD—show that our method achieves new state-of-the-art (SOTA) performance, demonstrating the effectiveness and generalizability of our approach in improving LLM-based emotional understanding.

Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning

This paper explores the challenges of integrating tactile sensing into intelligent systems for multimodal reasoning, particularly in enabling commonsense reasoning about the open-ended physical world. We identify two key challenges: **modality discrepancy**, where existing touch-language models often treat touch as a mere sub-modality of language without further addressing the semantic differences, and **open-ended tactile data scarcity**, where current datasets lack the diversity, open-endedness, and complexity needed for reasoning. To overcome these challenges, we introduce STOLA, a **S**elf-Adaptive **To**uch-**La**nguage framework. STOLA utilizes Mixture of Experts (MoE) to dynamically process, unify, and manage tactile and language modalities, capturing their unique characteristics. Crucially, we also present a comprehensive tactile commonsense reasoning dataset and benchmark featuring free-form questions and responses, 8 physical properties, 4 interactive characteristics, and diverse commonsense knowledge. Experiments show STOLA exhibits competitive performance compared to existing models on the PHYSICLEAR benchmark and self-constructed datasets, proving the effectiveness of the Mixture of Experts architecture in multimodal management and the performance advantages for open-scenario tactile commonsense reasoning tasks.

STOLA: Self-Adaptive Touch-Language Framework for Tactile Commonsense Reasoning in Open-Ended Scenarios

Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in term-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained term discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.

Content not yet available

Next from AAAI 2026

GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES