Singapore

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as tracking, RoI cropping, and non-maximum suppression, limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency. Project page: https://github.com/zgspose/PAVENet

AAAI 2026

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Class-incremental learning (CIL) has recently gained great attention in the field of time series classification.
Existing methods based on knowledge distillation exhibit impressive ability to preserve prior knowledge and overcome catastrophic forgetting, however, their effectiveness faces a major challenge posed by time series data.
Since temporal data are more susceptible to sensor errors and electronic noise,
the distillation process may be negatively affected by noisy knowledge transfer.
To address this issue, we propose a novel confidence-guided mask distillation (CMD) framework,
to prevent the noisy inheritance during distillation. 
The core of CMD lies in a dynamic masking mechanism guided by prediction confidence, 
capable of allocating higher weights to high-confidence time series and substantially suppressing
the influence of low-confidence ones.
Additionally, different from prior work passing a set of feature prototypes to the classifier simply,
we develop prototype-guided contrastive learning (PCL) to alleviate the classifier bias on new
classes, through extra contrastive constraints to push away the feature distributions of old feature prototypes from those of new classes features.
Extensive experiments on three time-series datasets demonstrate that our CDMD method significantly outperforms other replay-free CIL approaches in raising average accuracy, as well as decreasing forgetting rate.

Time Series Class-Incremental Learning via Confidence-guided Mask Distillation and Prototype-guided Contrastive Learning

Real-world control systems require policies that are not only high-performing but also interpretable and robust. A promising direction toward this goal is model-based control, which learns system dynamics and cost functions from historical data and then uses these models to inform decision-making. Building on this paradigm, we introduce DiffOP, a novel framework for learning optimization-based control policies defined implicitly through optimization control problems.
Without relying on value function approximation, DiffOP jointly learns the cost and dynamics models and directly optimizes the actual control costs using policy gradients.
To enable this, we derive analytical policy gradients by applying implicit differentiation to the underlying optimization problem and integrating it with the standard policy gradient framework.
Under standard regularity conditions, we establish that DiffOP converges to an $\epsilon$-stationary point within $\mathcal{O}(\epsilon^{-1})$ iterations.
We demonstrate the effectiveness of DiffOP through experiments on nonlinear control tasks and power system voltage control with constraints.

DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients

Composed Image Retrieval (CIR) combines the reference image with text to retrieve the intended target image. Recently, zero-shot CIR has gained significant attention by eliminating the need for labeled triplets required in supervised CIR. However, it inevitably demands additional training corpus, storage, and computational resources, limiting its applicability in real-world scenarios. Inspired by advancements in Test-Time Adaptation (TTA), we propose a Test-Time CIR setting named TT-CIR, which aims to efficiently adapt models to unlabeled test samples while reducing resource consumption. Within the TT-CIR setting, we identify that naively introducing existing TTA methods (e.g., reward-based) into CIR faces two vital challenges: 1) Modification-restricted reward pool, which limits the exploration of semantically relevant candidate rewards; 2) Conservative knowledge feedback, which inhibits the adaptability of rewards to the current data distribution. To address these challenges, we propose a test-time reinforcement learning framework that integrates a Counterfactual-guided Multinomial Sampling (CMS) strategy and a Duplex Rewards Modeling (DRM) module. The CMS explores a candidate reward pool that is both visually similar and semantically relevant to the given query, while the DRM generates stable and adaptive duplex rewards to guide model adaptation. Extensive experiments demonstrate the superiority and adaptability of our method over existing approaches. Code is available in the supplementary materials.

Duplex Rewards Optimization for Test-Time Composed Image Retrieval

Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: 
(i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100$\times$), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. 
(ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention.
To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment.
Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. 
Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. 
Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder.
Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages.
For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7\% higher mAP than the baseline in instance segmentation task.

SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

Video Camouflaged Object Detection (VCOD) poses significant challenges due to the subtle appearance of camouflaged objects, especially under dynamic motion and occlusion. Existing methods predominantly rely on optical flow or black-box features for motion modeling, which often entail substantial computational costs and suffer from limited interpretability. Inspired by the human strategy of identifying abnormal movements between frames and the principle of event camera image formation, we propose an eventstream-inspired dual-branch framework for VCOD. Specifically, we design an eventstream-like data extraction module to capture pixel-level motion variations, effectively distinguishing object motion from background dynamics. This event-based representation is integrated into SAM2 through a dual-branch memory-augmented framework, consisting of Time Bridge Attention and Visual Bridge Attention, enabling joint modeling of motion and appearance cues. In addition, we introduce a Prompt Embedding Generator to eliminate the need for human-provided interactive prompts, facilitating fully automatic VCOD. Extensive experiments on MoCA-Mask and CAD2016 demonstrate that our approach significantly outperforms state-of-the-art methods, achieving both superior segmentation accuracy and interpretable motion modeling. To the best of our knowledge, this is the first work to incorporate eventstream-inspired representations into the VCOD task. Code and related resources will be released.

Towards Explainable Video Camouflaged Object Detection: SAM2 with Eventstream-Inspired Data

In this paper, we propose new randomized algorithms for estimating the two-to-infinity and one-to-two norms in a matrix-free setting, using only matrix-vector multiplications. Our methods are based on appropriate modifications of Hutchinson's diagonal estimator and its Hutch++ version. We provide oracle complexity bounds for both modifications. We further illustrate the practical utility of our algorithms for Jacobian-based regularization in deep neural network training on image classification tasks. We also demonstrate that our methodology can be applied to mitigate the effect of adversarial attacks in the domain of recommender systems.

Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation

Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed set of object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient in providing a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, and material. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed simply by scaling up object classes during training. We highlight the limitations of existing methodologies and explore promising directions to overcome the identified shortcomings.

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Currently, pretrained models are rapidly scaling in size, which substantially increases the cost of fine-tuning them for downstream tasks. To address this challenge, parameter-efficient fine-tuning (PEFT) methods have been developed to optimize a minimal set of parameters for adaptation. While current PEFT approaches predominantly employ an "additive'' strategy, introducing learnable modules into inputs or architectures, neglect the inherent knowledge embedded within pretrained models, which may be redundant or even conflict with downstream tasks. This limitation leads to increased inference latency and suboptimal transfer performance, particularly in scenarios with significant domain gaps. In this paper, we propose a Subtractive Fine-tuning Paradigm(SFP), which converts multiple redundant operations within the original module into a linear transformation to enhance inference speed and model performance. Specifically, we introduce a compact filter block to replace specific module with interference and redundancy in the original structure to reduce model conflicts. By using a pseudo inverse matrix to construct filter block, ensuring that it can inherit the knowledge of the replacement module, and then freezing the rest of the model, only fine-tuning the filter block is performed to eliminate interference and redundant knowledge, thereby enhancing the model’s adaptability to downstream tasks. Experimental results demonstrate that our SFP outperforms existing PEFT methods in accuracy while decreasing the overall model parameters by 12%. Compared to full fine-tuning, the accuracy has increased by 8.47%(74.04% vs. 65.57%, VTAB).

Less Is More: Rethinking Parameter-Efficient Fine-Tuning from a Subtractive Perspective

The discovery of symbolic solutions—mathematical expressions, logical rules, and algorithmic structures—is fundamental to advancing scientific and engineering progress.
However, traditional methods often struggle with search efficiency and fail to integrate knowledge effectively.
While recent large language model-based (LLM-based) approaches have demonstrated improvements in search efficiency, they lack the ability to continually refine and expand upon discovered solutions and their underlying knowledge, limiting their potential for \textit{open-ended innovation}.
To address these limitations, we introduce CoEvo, a novel framework that leverages large language models within an evolutionary search methodology to continually generate and refine symbolic solutions. CoEvo integrates a dynamic knowledge library, enabling open-ended innovation of solutions through effective knowledge management. Additionally, CoEvo leverages multiple representations of solutions—including natural language, mathematical expressions, and code—to further enhance search efficiency.
By combining the reasoning capabilities of LLMs with the exploratory power of evolutionary algorithms, CoEvo significantly improves the efficiency and scope of symbolic discovery.
Our experimental results demonstrate that this method not only enhances the efficiency of searching for symbolic solutions but also supports the ongoing discovery process, akin to human scientific endeavors. This study represents a first effort in conceptualizing the search for symbolic solutions as a lifelong, iterative
process, marking a significant step towards harnessing LLMs in the perpetual pursuit of scientific and engineering breakthroughs.

CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models

Multi-view learning aims to effectively integrate data from different sources by exploring the consistency and complementarity across views. Current multi-view methods based on Graph Convolutional Networks (GCNs) primarily focus on local information, making it difficult to capture global dependencies. Furthermore, multi-view data typically lack explicit structural representations, and the topologies constructed via node similarity in existing approaches are prone to noise, with simple fusion strategies are often inadequate for effectively suppressing this noise and for uncovering meaningful structural information. To tackle these issues, this paper proposes CoGFormer, a cooperative graph transformer with structural consensus learning. CoGFormer maps multi-view data into a unified space and jointly models local and global consensus: a denoising structural consensus graph convolutional network refines the consensus graph to enhance local consistency and robustness, while a structure-guided attention mechanism explicitly injects high-order cross-view structural biases to capture global consistency and improve semantic coherence. Experiments on multiple benchmarks demonstrate that CoGFormer outperforms existing state-of-the-art methods, validating its effectiveness.

Premium content

Next from AAAI 2026

Time Series Class-Incremental Learning via Confidence-guided Mask Distillation and Prototype-guided Contrastive Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES