Singapore

This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. While existing VOS methods mainly focus on single-shot videos, they often fail to handle shot discontinuities, thereby limiting their real-world applicability. Furthermore, the lack of annotated multi-shot data poses a major challenge for MVOS research. To address these issues, we propose a transition mimicking data augmentation strategy (TMA) that enables cross-shot generalization using single-shot data, and a transition-aware method, Segment Anything Across Shots (SAAS), which detects and comprehends shot transitions during inference. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and data samples are released at https://anonymous.4open.science/r/AAAI2026-3280.

AAAI 2026

Segment Anything Across Shots: A Method and Benchmark

cv: segmentation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

LVLMs have been shown to perform excellently in image-level tasks such as VQA and caption. However, in many instance-level tasks, such as visual grounding and object detection, LVLMs still show performance gaps compared to previous expert models. Meanwhile, although pedestrian tracking is a classical task, there have been a number of new topics in combining object tracking and natural language, such as Referring MOT, Cross-view Referring MOT, and Semantic MOT. These tasks emphasize that models should understand the tracked object at an advanced semantic level, which is exactly where LVLMs excel. In this paper, we propose a new unified Pedestrian Tracking framework, namely OmniPT, which can track, track based on reference and generate semantic understanding of tracked objects interactively. We address two issues: how to model the tracking task into a task that foundation models can perform, and how to make the model output formatted answers. To this end, we implement a training phase consisting of RL-Mid Training-SFT-RL. Based on the pre-trained weights of the LVLM, we first perform a simple RL phase to enable the model to output fixed and supervisable bounding box format. Subsequently, we conduct a mid-training phase using a large number of pedestrian-related datasets. Finally, we perform supervised fine-tuning on several pedestrian tracking datasets, and then carry out another RL phase to improve the model's tracking performance and enhance its ability to follow instructions. We conduct experiments on tracking benchmarks and the experimental results demonstrate that the proposed method can perform better than the previous methods.

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Current roadside perception systems mainly focus on instance-level perception, which fall short in enabling interaction via natural language and reasoning about traffic behaviors in context. To bridge this gap, we introduce RoadSceneVQA, a large-scale and richly annotated visual question answering (VQA) dataset specifically tailored for roadside scenarios. The dataset comprises 34,736 diverse QA pairs collected under varying weather, illumination, and traffic conditions, targeting not only object attributes but also the intent, legality, and interaction patterns of traffic participants. RoadSceneVQA challenges models to perform both explicit recognition and implicit commonsense reasoning, grounded in real-world traffic rules and contextual dependencies. To fully exploit the reasoning potential of Multi-modal Large Language Models (MLLMs), we further propose CogniAnchor Fusion (CAF), a vision-language fusion module inspired by human-like scene anchoring mechanisms. CAF enables precise and efficient cross-modal interaction. Moreover, we propose the Assisted Decoupled Chain-of-Thought (AD-CoT) to enhance the reasoned thinking via CoT prompting and multi-task learning. Experimental results on RoadSceneVQA and CODA-LM benchmark show that the pipeline consistently improves both reasoning accuracy and computational efficiency, allowing the MLLM to achieve state-of-the-art performance in structural traffic perception and reasoning tasks.

RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System

Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification. Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120 times faster than FedAVG).

Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification

Offline reinforcement learning (RL) can learn policies from pre-collected offline datasets without interacting with the environment, but it suffers from the issue of out-of-distribution (OOD). Recent methods use the generative adversarial paradigm to learn policies, but easily fail to handle the conflict of fooling the discriminator and maximizing expected returns. In this paper, we propose a novel offline RL method named Distribution-Matching Generator-based Diffusion Policies (DMGDP). A distribution matching-based policy learning method is first developed, where the diffusion serves as the policy generator, to handle the conflict of fooling the discriminator and maximizing expected returns. Furthermore, a policy confidence mechanism based on discriminator regularization is designed to prevent the agent from taking OOD actions, with the aim of robust generative adversarial learning. We conducted extend experiments on the D4RL benchmarks, and the results demonstrate that DMGDP outperforms state-of-the-art methods.

Enhancing Diffusion Policies with Distribution-Matching Generator in Offline Reinforcement Learning

The growing demand for psychological support underscores the lack of high-quality counseling dialogue datasets, particularly in non-English contexts. We propose PGSim, a Path-Guided Simulation framework that mirrors real counseling processes—symptom description, problem identification, cause analysis, strategy planning, and iterative adjustment. PGSim models each user scenario as a fine-grained quadruple {Group, Psychological Problem, Problem Cause, Support Focus} and guides dialogue generation through expert-annotated strategy paths. Real counseling dialogues and expert-edited samples are used to fine-tune two language models: a Dialog Generator for strategy-aligned dialogue creation and a Dialog Modifier for expert-level refinement. After automated and human verification, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 problems, 13 causes, and 12 support focuses. We further present the Comprehensive Agent Dialogue Support System (CADSS), which integrates profiling, summarization, strategy planning, and empathetic response. Experiments on CPsDD and ESConv demonstrate that CADSS achieves state-of-the-art results on Strategy Prediction and Emotional Support Conversation tasks.

Simulating Human-Like Counseling: A Path- and Scenario-Guided Framework for Psychological Support Dialogue

Vision foundation models (e.g., SAM2, CLIP) show strong generalization in natural image analysis but degrade significantly in specialized domains like medical imaging. This is critical for tasks such as brain tumor segmentation, where errors directly affect surgical planning and patient outcomes. In such contexts, segmentation must be highly reliable and structurally precise, underscoring the need for adaptable methods with low error tolerance. While fine-tuning is the dominant strategy, it is computationally expensive and prone to forgetting. To address this, we propose CausalBridgeNet, a causality-guided correction framework for medical image segmentation. Inspired by predictive coding theories of the Bayesian brain, our method introduces a Predictive Causal Reasoning Unit (PCRU) that estimates structured error maps and delivers targeted feedback to iteratively refine predictions. This forms a closed-loop, error-aware correction mechanism without modifying the foundation model. By keeping the backbone frozen, CausalBridgeNet preserves general visual priors while enhancing task-specific accuracy. On the BraTS 2025 benchmark, it achieves an average Dice score of 84.48 and HD95 of 5.48 across tumor subregions, demonstrating its effectiveness for high-precision medical segmentation.

Make Foundation Models Trustworthy Again: Causal Fine-Adaptation for Medical Image Segmentation

DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability. Code and weights are available at https://github.com/FoundationVision/FlashVideo.

FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

Deep neural networks have recently achieved notable progress in 3D point cloud recognition, yet their vulnerability to adversarial perturbations poses critical security challenges in practical deployments. Conventional defense mechanisms struggle to address the evolving landscape of multifaceted attack patterns. Through systematic analysis of existing defenses, we identify that their unsatisfactory performance primarily originates from an entangled feature space, where adversarial attacks can be performed easily. To this end, we present 3D-ANC, a novel approach that capitalizes on the Neural Collapse (NC) mechanism to orchestrate discriminative feature learning. In particular, NC depicts where last-layer features and classifier weights jointly evolve into a simplex equiangular tight frame (ETF) arrangement, establishing maximally separable class prototypes. However, leveraging this advantage in 3D recognition confronts two substantial challenges: (1) prevalent class imbalance in point cloud datasets, and (2) complex geometric similarities between object categories. To tackle these obstacles, our solution combines an ETF-aligned classification module with an adaptive training framework consisting of representation-balanced learning (RBL) and dynamic feature direction loss (FDL). 3D-ANC seamlessly empowers existing models to develop disentangled feature spaces despite the complexity in 3D data distribution. Comprehensive evaluations state that 3D-ANC significantly improves the robustness of models with various structures on two datasets. For instance, DGCNN's classification accuracy is elevated from 27.2% to 80.9% on ModelNet40 -- a 53.7% absolute gain that surpasses leading baselines by 34.0%.

3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition

Current brain-computer interfaces primarily decode single motor variables, limiting their ability to support natural, high-bandwidth neural control that requires simultaneous extraction of multiple correlated motor dimensions. We introduce Multi-dimensional Neural Decoding (MND), a task formulation that simultaneously extracts multiple motor variables (direction, position, velocity, acceleration) from single neural population recordings. MND faces two key challenges: cross-task interference when decoding correlated motor dimensions from shared cortical representations, and generalization issues across sessions, subjects, and paradigms. To address these challenges, we propose OrthoSchema, a multi-task framework inspired by cortical orthogonal subspace organization and cognitive schema reuse. OrthoSchema enforces representation orthogonality to eliminate cross-task interference and employs selective feature reuse transfer for few-shot cross-session, subject and paradigm adaptation. Experiments on macaque motor cortex datasets demonstrate that OrthoSchema significantly improves decoding accuracy in cross-session, cross-subject and challenging cross-paradigm generalization tasks, with larger performance improvements when fine-tuning samples are limited. Ablation studies confirm the synergistic effects of all components are crucial, with OrthoSchema effectively modeling cross-task features and capturing session relationships for robust transfer. Our results provide new insights into scalable and robust neural decoding for real-world BCI applications.

Multi-dimensional Neural Decoding with Orthogonal Representations for Brain-Computer Interfaces

Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for ''walking''). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into ''what to count'' and ''where to see'' via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

Downloads

Next from AAAI 2026

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads