Singapore

Vision-Language Models (VLMs) have shown significant potential in surgical scene analysis, yet existing models are limited by frame-level datasets and lack high-quality video data with procedural surgical knowledge. To address these challenges, we make the following contributions: (i) SurgPub-Video, a comprehensive dataset of over 3,000 surgical videos and 25 million annotated frames across 11 specialties, sourced from peer-reviewed clinical journals, (ii) SurgLLaVA-Video, a specialized VLM for surgical video understanding, built upon the TinyLLaVA-Video architecture that supports both video-level and frame-level inputs, and (iii) a video-level surgical Visual Question Answering (VQA) benchmark, covering diverse 11 surgical specialities, such as vascular, cardiology, and thoracic. Extensive experiments, conducted on the proposed benchmark and three additional surgical downstream tasks (action recognition, skill assessment, and triplet recognition), show that SurgLLaVA-Video significantly outperforms both general-purpose and surgical-specific VLMs with only three billion parameters. The dataset, model, and benchmark will be released to enable further advancements in surgical video understanding.

AAAI 2026

SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model

surgical video

multi-modal large language model

visual question answering

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (\emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (\emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model's ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.

Improved Masked Image Generation with Knowledge-Augmented Token Representations

In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains.
Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. 
Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences.
To this end, we propose a new method, i.e., Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network–driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains.
By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts.
Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method.

Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection

Multi-view 3D object detection plays a vital role in autonomous driving systems due to its ability to perceive complex scenes accurately. However, real-world driving data often exhibits a long-tailed distribution, causing significant drops in detection accuracy for rare categories in existing methods. To mitigate this issue, we propose CLIPDet3D, a novel vision-language collaborative framework for multi-view 3D object detection. First, to tackle the difficulty of capturing the semantic information of rare categories, a Vision-Language Collaborative Learning strategy is proposed to incorporate class-level semantic priors from CLIP. Second, a Depth Feature Contrastive Distillation module is designed to overcome the large depth estimation error for rare categories by aligning depth features between a teacher and a student network. Furthermore, to alleviate the difficulty in focusing on regions of rare categories, a Dual-Stream Prompt Attention mechanism is devised to inject learnable prompts and compute attention along both horizontal and vertical BEV directions. Evaluations on the nuScenes dataset demonstrate that CLIPDet3D achieves state-of-the-art accuracy while maintaining efficient inference.

CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection

In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of $7,346$ burst sequences with $45,827$ images and $191,572$ annotated quality scores for multiple downstream scenarios.
Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. 
Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve $0.33$ dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.

Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks

While adapting pretrained vision models to downstream dense prediction tasks is widely used, current methods often overlook adaptation efficiency, especially in the context of multi-task learning (MTL). Although parameter-efficient fine-tuning (PEFT) methods can enhance parameter efficiency, broader aspects such as GPU memory and training time efficiency remain underexplored. In this paper, we propose a new paradigm that simultaneously achieves efficiency in Parameters, GPU Memory, and Training Time for Multi-Task Dense Vision Adaptation. Specifically, we propose a dual-branch framework, in which a frozen pretrained backbone serves as the generic main branch, and the proposed Bi-Directional Task Adaptation (BDTA) modules are integrated in parallel to form a task bypass branch that extracts adaptation features required by multiple specific tasks. This adaptation module is lightweight, efficient, and does not require backpropagation through the large pre-trained backbone, thus avoiding resource-intensive gradient computations. Moreover, a Mixture of Task Experts mechanism (MoTE) is further proposed to integrate adaptation features across tasks and scales, thereby obtaining more robust representations tailored for dense prediction tasks. On the PASCAL-Context benchmark, our method achieves over 2× relative performance improvement compared to the best prior multi-task PEFT method, while using only $\sim$30% of the parameters, $\sim$50% of the memory, and $\sim$60% of the training time, demonstrating superior overall adaptation efficiency.

Parameter-, Memory-, Time-Efficient Multi-Task Dense Vision Adaptation

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework.

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Missing values in high-dimensional, mixed-type datasets pose significant challenges for data imputation, particularly under Missing Not At Random (MNAR) mechanisms. Existing methods struggle to integrate local and global data characteristics, limiting performance in MNAR and high-dimensional settings. We propose an innovative framework, RefiDiff, combining local machine learning predictions with a novel Mamba-based denoising network efficiently capturing long-range dependencies among features and samples with low computational complexity. RefiDiff bridges the predictive and generative paradigms of imputation, leveraging pre-refinement for initial warm-up imputations and post-refinement to polish results, enhancing stability and accuracy. By encoding mixed-type data into unified tokens, RefiDiff enables robust imputation without architectural or hyperparameter tuning. RefiDiff outperforms state-of-the-art (SOTA) methods across missing-value settings, demonstrating strong performance in MNAR settings and superior out-of-sample generalization. Extensive evaluations on nine real-world datasets demonstrate its robustness, scalability, and effectiveness in handling complex missingness patterns.

RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks—such as video-language generation—introduces two core challenges: *1) visual token inflation*, where long visual-token sequences restrict context capacity and result in insufficient feedback signals; *2) a lack of process-level supervision*, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization.
We present **UniAPO**: **Uni**fied Multimodal **A**utomated **P**rompt **O**ptimization, the first framework tailored for multimodal APO. 
UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. 
To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization.
UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

UniAPO: Unified Multimodal Automated Prompt Optimization

Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning.
However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests.
In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better.
To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions.
GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information.
GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions.
Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones.
However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15\% to 73.98\% on GSM-MC.
Our data and code will be released upon acceptance. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

We challenge the assumption that complex instruction-guided segmentation tasks necessitate equally complex and explicit supervision. This paper introduces RISE (Reasoning via Implicit Self-supervised Emergence), a framework that learns intricate compositional reasoning, spanning spatial relations to world knowledge, without a single ground-truth mask. To achieve this, RISE employs reinforcement learning with GRPO guided by a single, strikingly simple reward: the semantic alignment score between the textual instruction and the predicted image region. Our primary discovery is the implicit emergence of a high-quality chain-of-thought process from this minimalist signal. Within a structured format, the model autonomously learns to understand instructions by accessing its latent knowledge, inferring spatial relationships—capabilities inherent in its architecture but unlocked by our simple objective. Remarkably, our emergent reasoning yields highly competitive results: RISE achieves 58.7 gIoU on the ReasonSeg benchmark, on par with methods using geometric rewards. Furthermore, we show extreme data efficiency: a variant trained on only 2,000 ImageNet-label pairs establishes a new state-of-the-art for annotation-free referring segmentation with 73.7 cIoU on RefCOCO, drastically outperforming prior work (46.5).

Downloads

Next from AAAI 2026

Improved Masked Image Generation with Knowledge-Augmented Token Representations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Improved Masked Image Generation with Knowledge-Augmented Token Representations

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads