Singapore

Person re-identification (ReID) aims to retrieve target pedestrian images based on either visual queries (I2I) or textual descriptions (T2I). Although both tasks share the same retrieval objective, they face distinct challenges: I2I focuses on learning discriminative identity representations, while T2I emphasizes cross-modal semantic alignment. Existing approaches typically handle these tasks separately or naively combine them, which often leads to task interference and performance degradation.
To address this, we propose a unified framework that leverages task-aware prompt learning to jointly optimize both tasks. Specifically, we design a Task-Routed Transformer that introduces dual classification tokens within a shared visual encoder to decouple task-specific representations. On top of this, we develop a Task-Conditioned Prompt Alignment module that constructs hierarchical prompts by integrating identity-level learnable tokens with sample-level pseudo-text tokens. These pseudo-tokens are converted from image or text features via modality-specific decoders, injecting fine-grained instance-level semantics into the prompts. Furthermore, we introduce a Cross-Modal Prompt Regularization strategy to enforce semantic alignment in the prompt token space, encouraging pseudo-prompts to preserve source-modality semantics while enhancing cross-modal transferability.
Extensive experiments on multiple benchmark datasets demonstrate that our approach effectively mitigates task interference and achieves state-of-the-art performance on both I2I and T2I person ReID tasks.

AAAI 2026

Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification

representation learning for vision

image and video retrieval

multimodal learning

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Asynchronous Federated Learning (AFL) is acclaimed for accelerating collaborative training on heterogeneous systems by eliminating the wait for stragglers. While current solutions focus on improving convergence amidst update delays, they neglect how delayed aggregation fosters free-riding attacks, allowing malicious clients to easily extract the global model without contribution. This behavior results in significant fairness issues and performance degradation. To address this challenge, we propose OPTION, the first online pricing strategy tailored to mitigate free-riding in AFL. OPTION establishes an economic model in which access to model updates is purchased using credits earned from verified contributions. Specifically, OPTION values each model update according to its marginal performance gain and training cost, and subsequently necessitates a download fee from each client based on the Hotelling model to prevent zero-cost acquisition. Moreover, OPTION rewards clients for successful updates under non-arbitrage constraints, effectively balancing individual utility and task budget. To maximize the average model performance while satisfying these conditions, OPTION leverages the Lyapunov drift framework and a probabilistic sampling-based algorithm to optimize the pricing parameters. Extensive experimental results on three real-world datasets demonstrate that OPTION effectively mitigates free-riding attacks in AFL, increases the number of valid updates by at least 23.97%, and achieves a model accuracy improvement of at least 3.01% compared to state-of-the-art baselines.

OPTION: An Online Pricing Strategy for Asynchronous Federated Learning Against Free-Riding Attacks

Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they remain fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce **AbductiveMLLM** comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER’s output embeddings to “imagine” plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that **AbductiveMLLM** achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs. Our code will be released.

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs

Solving the problem of cooperation is fundamentally
important for the creation and maintenance of functional
societies. Problems of cooperation are omnipresent within
human society, with examples ranging from navigating busy
road junctions to negotiating treaties. As the use of AI
becomes more pervasive throughout society, the need for
socially intelligent agents capable of navigating these
complex cooperative dilemmas is becoming increasingly
evident. Direct punishment is a ubiquitous social mechanism
that has been shown to foster the emergence of cooperation
in both humans and non-humans. In the natural world, direct
punishment is often strongly coupled with partner selection
and reputation and used in conjunction with third-party
punishment. The interactions between these mechanisms could
potentially enhance the emergence of cooperation within
populations. However, no previous work has evaluated the
learning dynamics and outcomes emerging from multi-agent
reinforcement learning populations that combine these
mechanisms. This paper addresses this gap. It presents a
comprehensive analysis and evaluation of the behaviors and
learning dynamics associated with direct punishment,
third-party punishment, partner selection, and reputation.
Finally, we discuss the implications of using these
mechanisms on the design of cooperative AI systems.

Investigating the Impact of Direct Punishment on the
Emergence of Cooperation in Multi-agent Reinforcement
Learning Systems

Dataset distillation (DD) aims to generate a compact synthetic dataset that enables efficient training of neural networks while maintaining performance comparable to that achieved with the original dataset. However, existing methods often suffer from two main limitations. They either rely on computationally intensive iterative optimization procedures or depend heavily on architecture-specific designs. These issues limit their practicality for large-scale datasets and hinder generalization across different model architectures. 
To overcome these challenges, recent research has explored the use of diffusion models as an architecture-agnostic approach to dataset distillation, offering improved scalability and generalization for large-scale datasets across diverse model architectures.
While diffusion-based dataset distillation methods have shown considerable potential, several challenges remain. Notably, certain approaches exhibit a distributional mismatch between the pre-trained diffusion model and the target dataset, which can adversely affect the fidelity and representativeness of the generated samples.
Others require substantial fine-tuning to achieve high fidelity, which negates the benefits of architectural flexibility. In this work, we propose a new diffusion-based dataset distillation framework that effectively preserves the characteristics of the original dataset without requiring any fine-tuning. Our method employs adaptive sampling and repulsion regularization to enhance both the fidelity and diversity of generated samples. As a result, the proposed approach outperforms state-of-the-art distillation methods across a wide range of datasets and model architectures.

An Adaptive Sampling Framework for Diffusion-based Dataset Distillation with High Fidelity and Diversity

Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffer from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating a understanding task (CSR) that provides fine-grained CS visual-semantic cues to to guide the speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual–semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11,282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments conducted on this dataset demonstrate that UniCUE achieves state-of-the-art (SOTA) performance across multiple evaluation metrics.

UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and *share a common subspace*. We assume that a large amount $M$ of unlabeled data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of "labeled" data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to "implicit feedback" (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $O\left(\sqrt{\frac{nd}{M}}\right)$ and $O\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $G$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate structured knowledge via graph retrieval as contextual input, enhancing more accurate and context-aware reasoning. We observe that for different queries, it could retrieve similar subgraphs as prompts, and thus we propose SubGCache, which aims to reduce inference latency by reusing computation across queries with similar structural prompts (i.e., subgraphs). Specifically, SubGCache clusters queries based on subgraph embeddings, constructs a representative subgraph for each cluster, and pre-computes the key-value (KV) cache of the representative subgraph. For each query with its retrieved subgraph within a cluster, it reuses the pre-computed KV cache of the representative subgraph of the cluster without computing the KV tensors again for saving computation. Experiments on two datasets across multiple LLM backbones and graph-based RAG frameworks demonstrate that SubGCache consistently reduces inference latency with comparable and even improved generation quality, achieving up to 6.68x reduction in time-to-first-token (TTFT). The codes and datasets are available at Supplementary Material.

SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache

Generative modeling has transformed many fields like language and visual modeling, while its exploit in financial market is under-explored. As the minimal unit within a financial market is an order, order flow modeling represents the fundamental generative financial task. However, current approaches often result in unsatisfactory fidelity in generating order flow, and their generation lacks controllability, thereby limiting their application scenario. In this paper, we advocate incorporating controllability into the market generation process, and propose a Diffusion Guided meta Agent (DiGA) model to address the challenge. Specifically, we utilize a diffusion model to capture dynamics of market state represented by time-evolving distribution parameters about mid-price return rate and order arrival rate, and define a meta agent with financial economic priors to generate orders from the corresponding distributions. Extensive experimental results demonstrate that our method exhibits outstanding controllability and fidelity in generation. Furthermore, we validate DiGA's effectiveness as generative environment for downstream financial applications.

Controllable Financial Market Generation with Diffusion Guided Meta Agent

The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly $1$M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by $11.41\%$ compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.

Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication

Anytime multi-agent path finding (MAPF) is a promising approach to scalable and collision-free path optimization in multi-agent systems. MAPF-LNS, based on Large Neighborhood Search (LNS), is the current state-of-the-art approach where a fast initial solution is iteratively optimized by destroying and repairing selected paths, i.e., a neighborhood, of the solution. Delay-based MAPF-LNS has demonstrated particular effectiveness in generating promising neighborhoods via seed agents according to their delays. Seed agents are selected using handcrafted strategies or online learning, where the former relies on human intuition about underlying structures, while the latter conducts black box optimization, ignoring any structure. In this paper, we propose Truncated Adaptive Counterfactual K-ranked LEarning (TACKLE) to select seed agents using human intuition for informed online learning. We show theoretically that TACKLE dominates its handcrafted and black box learning counterparts in the limit. Our experiments demonstrate cost improvements by at least 60% in instances with up to a thousand agents, compared with state-of-the-art anytime MAPF solvers.

Downloads

Next from AAAI 2026

OPTION: An Online Pricing Strategy for Asynchronous Federated Learning Against Free-Riding Attacks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES