Singapore

Most existing RGB-Event trackers rely on strictly aligned datasets, overlooking the asynchronous spatio-temporal resolutions common in real-world scenarios. 
This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance.
To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. 
Our TDA framework follows an encode-decode-align paradigm: it first encodes multimodal features to generate target-related priors, which are then progressively decoded to guide a subsequent feature alignment pass. 
Within this framework, we introduce two key innovations: (1) a Cross-Prior Attention (CPA) module that effectively generates and integrates cross-modal priors, and (2) a Cross-Modal Semantic Alignment (CSA) loss that maximizes mutual information to enforce semantic consistency between modalities. 
Extensive experiments show that AlignTrack achieves state-of-the-art performance on four challenging RGB-Event tracking benchmarks, demonstrating its robustness in both aligned and unaligned scenarios. 
Ablation studies further validate the significant contribution of each proposed component.

AAAI 2026

AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking

computer vision;event camera;object tracking

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. 
However, existing studies predominantly resort to prompt-based simulation using frozen LLMs, which frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance.
To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling.
Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items.
Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning.
Experimental results on a large-scale video recommendation benchmark MicroLens-100k have demonstrated the effectiveness of our proposed VRAgent-R1 method,
e.g., the IP Agent achieves a 6.0\% improvement in NDCG@10, while the US Agent shows approximately 45.0\% higher accuracy in user decision simulation compared to state-of-the-art baselines.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model'' paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on \emph{flow matching} that can generate diverse visual representations across multiple tasks.
Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer.
Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

Visual Bridge: Universal Visual Perception Representations Generating

Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

Spatio-temporal data generation aims to synthesize realistic urban data across graph nodes by learning spatial and temporal dependencies. This task plays a crucial role in urban planning by enabling the simulation of unobserved nodes. However, existing approaches face critical limitations that time series generation methods fail to generalize to unseen nodes, while spatio-temporal generative models are either restricted to the trajectory generation task or dependent on auxiliary data inputs. To bridge these gaps, we propose a Knowledge Graph Guided Heterogeneity-Informed Diffusion Model (KGDiff) in this paper through the following key innovations. First, we design a geometry-aware mixture of experts integrating Euclidean, hyperbolic, and hyperspherical representations to comprehensively encode urban structural knowledge. Next, we present a learnable meta spatio-temporal pattern module that normalizes node-specific heterogeneity before the generation process, and a conditional denoising process that progressively transforms random noise into realistic samples under structural guidance. Finally, extensive experiments across real-world urban datasets demonstrate that KGDiff achieves the state-of-art performance in generating realistic urban spatio-temporal data.

Knowledge Graph Guided Heterogeneity-Informed Diffusion Model for Spatio-Temporal Generation

This paper investigates the impact of perturbations on the best-response-based algorithms approximating Nash equilibria in zero-sum games, namely Double Oracle and Fictitious Play. More precisely, we assume that the oracle computing the best responses perturbs the utilities before selecting the best response. We show that using such an oracle reduces the number of iterations for both algorithms. For some cases, suitable perturbations ensure the expected number of iterations is logarithmic. Although the utility perturbation is computationally demanding as it requires iterating through all pure strategies, we demonstrate that one can efficiently perturb the utilities in games where pure strategies have further inner structure.

Perturbing Best Responses in Zero-Sum Games

A primary motivation for analog integrated circuit (IC) design automation is the inefficiency of manual design in meeting increasingly stringent specifications, which often involve over 10 objectives. 
Recent advances in reinforcement learning (RL) emerge as a promising method, yet gaps remain when considering full design specifications, especially under process-voltage-temperature (PVT) variations.
Excessive objectives lead to diminished reward signals, while varying PVT conditions result in conflicting gradients, both of which result in inefficient exploration.
To address these, we propose a priority-based graph-enhanced RL framework.
Specifically, using fuzzy logic converts quantitative rewards into qualitative priority signals, mitigating reward deterioration and enhancing exploration via entropy regularization. 
Furthermore, a graph-based representation compresses high-dimensional objective spaces under PVT variations into low-dimensional manifolds, enabling dynamic resource allocation to variation-sensitive regions and resolving gradient conflicts. 
Empirical results on various real-world analog ICs demonstrate that our method significantly outperforms existing RL algorithms, achieving superior solution quality and reducing simulation overhead.

Priority-Based Graph-Enhanced Reinforcement Learning for Robust Analog Circuit Optimization

Deep ensembles (DE) have emerged as a powerful approach for quantifying predictive uncertainty and distinguishing its aleatoric and epistemic components, thereby enhancing model robustness and reliability. However, their high computational and memory costs during inference pose significant challenges for wide practical deployment. To overcome this issue, we propose credal ensemble distillation (CED), a novel framework that compresses a DE into a single model, CREDIT, for classification tasks. Instead of a single softmax probability distribution, CREDIT predicts class-wise probability intervals that define a credal set, a convex set of probability distributions, for uncertainty quantification. Empirical results on out-of-distribution detection benchmarks demonstrate that CED achieves superior or comparable uncertainty estimation compared to several existing baselines, while substantially reducing inference overhead compared to DE.

Credal Ensemble Distillation for Uncertainty Quantification

One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8\% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot’s ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its piratical applicability.

ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user's interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

Livestreaming has become increasingly prevalent in modern visual communication, where automatic camera quality tuning is essential for delivering superior user Quality of Experience (QoE). Such tuning requires accurate blind image quality assessment (BIQA) to guide parameter optimization decisions. Unfortunately, the existing BIQA models typically only predict an overall coarse-grained quality score, which cannot provide fine-grained perceptual guidance for precise camera parameter tuning. To bridge this gap, we first establish $\textbf{FGLive-10K}$, a comprehensive fine-grained BIQA database containing 10,185 high-resolution images captured under varying camera parameter configurations across diverse livestreaming scenarios. The dataset features 50,925 multi-attribute quality annotations and 19,234 fine-grained pairwise preference annotations. Based on FGLive-10K, we further develop $\textbf{TuningIQA}$, a fine-grained BIQA metric for livestreaming camera tuning, which integrates human-aware feature extraction and graph-based camera parameter fusion. Extensive experiments and comparisons demonstrate that TuningIQA significantly outperforms state-of-the-art BIQA methods in both score regression and fine-grained quality ranking, achieving superior performance when deployed for livestreaming camera tuning.

Content not yet available

Next from AAAI 2026

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES