Singapore

Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.

AAAI 2026

CogStream: Context-guided Streaming Video Question Answering

streaming video understanding

large multimodal models

long-context reasoning

video question answering

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Despite the rapid progress of multimodal large language models (MLLMs), their capacity for low-level visual perception in underwater environments remains underexplored. To address this gap, we present UQ-Bench, the first systematically designed benchmark for evaluating the ability of MLLMs to perceive and assess underwater image quality at the low-level visual attribute level. UQ-Bench comprises three components: (1) UW-Perception, a dataset of 3,000 underwater images paired with targeted questions on key degradations such as color cast, blur, contrast, and exposure, covering both global and local perceptual dimensions; (2) UW-Describe, a dataset of 500 images with expert-annotated gold-standard descriptions for assessing the accuracy of model-generated text; and (3) UW-Eval, an evaluation protocol employing human mean opinion scores (MOS) for quantitative quality assessment. To ensure rigorous and reproducible benchmarking, we propose a GPT-assisted evaluation framework that aligns model outputs with expert references and enables fine-grained analysis of distortion perception. Experimental results demonstrate that while MLLMs exhibit preliminary competence in underwater low-level visual tasks, they still fall short in capturing subtle degradations and achieving human-level consistency, highlighting the need for further advances in foundation models for marine vision. Both the benchmark and code will be made publicly available.

UQ-Bench: A Benchmark for Evaluating Multimodal LLMs on Underwater Image Quality Assessment

Model robustness indicates a model's capability to generalize well on unforeseen distributional shifts, including data corruptions and adversarial attacks. Data augmentation is one of the most prevalent and effective ways to enhance the robustness. Despite the great success of the diverse augmentations in different fields, a unified theoretical understanding of their efficacy in improving model robustness is lacking. We theoretically reveal a general condition for label-preserving augmentations to bring robustness to diverse distribution shifts through the lens of flat minima and generalization bound, which de facto turns out to be strongly correlated with robustness against different distribution shifts in practice. Unlike most earlier works, our theoretical framework accommodates all the label-preserving augmentations and is not limited to particular distribution shifts. We substantiate our theories through different simulations on the existing common corruption and adversarial robustness benchmarks based on the CIFAR and ImageNet datasets.

A Flat Minima Perspective on Understanding Augmentations and Model Robustness

Point cloud quality assessment (PCQA) is essential for reliable 3D visual applications. While point-based methods face challenges in characterizing distortions due to point cloud disorder, projection-based approaches offer better efficiency but suffer from geometric distortion insensitivity and texture representation blind spots. This study proposes SAF-Net, a multi-view structure-aware feature fusion network for PCQA. We first identify two key limitations in projection-based methods: insufficient geometric distortion perception and representation blind spots (RBS) in texture images. To address these issues, SAF-Net innovatively integrates object mask maps and local binary pattern (LBP) maps. The mask maps enhance geometric distortion perception by extracting edge sharpness and curvature variations, while LBP maps capture essential structural information to overcome RBS and align with human visual system (HVS) sensitivity. SAF-Net employs a hybrid CNN-ViT architecture to balance local feature extraction and global context modeling, along with a progressive fusion strategy to optimize cross-modal feature interaction. Extensive experiments demonstrate the superior performance of SAF-Net on multiple benchmarks, establishing new state-of-the-art results in PCQA.

Point Cloud Quality Assessment via Multi-View Structure-Aware Feature Fusion

We present TrackGS, the first method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. While 3DGS delivers impressive rendering quality, its reliance on accurate precomputed camera parameters remains a significant limitation. Existing COLMAP-free approaches depend on local constraints that fail in complex scenarios. Our key innovation lies in leveraging feature tracks to establish global geometric constraints, enabling simultaneous optimization of camera parameters and 3D Gaussians. Specifically, we: (1) introduce track-constrained Gaussians that serve as geometric anchors, (2) propose novel 2D and 3D track losses to enforce multi-view consistency, and (3) derive differentiable formulations for camera intrinsics optimization. Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance, with much lower pose error than previous methods while maintaining superior rendering quality. Our approach eliminates the need for COLMAP preprocessing, making 3DGS more accessible for practical applications.

TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints

Behavior trees (BTs) are becoming a popular control architecture for robots, featuring modularity, reactivity, and robustness. BT planning, an emerging approach, provides a theoretical guarantee to generate reliable BTs for achieving tasks automatically. However, BT planning assumes that a well-designed BT system has already been grounded, including high-level action models and low-level control policies, which often requires expensive expert knowledge and effort in the specific domain. In this paper, we define the BT grounding problem, where an algorithm needs to automatically construct a complete and consistent BT system for a given task set. We demonstrate a naive algorithm which is sound and complete for solving this problem, but difficult to be implemented due to the exponential complexity. Then, we propose the first framework for efficiently solving the BT grounding problem, named Context-Aware Behavior Tree grOunding (CABTO). CABTO mainly utilizes pre-trained Large Models (LMs) to heuristically search the space of action models and control policies based on the contexts of BT planning and environmental feedback. Experiments on 3 robot manipulation task sets, involving a total of 15 tasks across different scenarios and robots, demonstrate CABTO’s effectiveness and efficiency in generating complete and consistent behavior tree systems.

CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation

Self-training large language models (LLMs) with generated reasoning paths has emerged as a promising approach to improve performance on complex reasoning tasks. However, most existing methods rely on correctness-based supervision, treating samples that reach the correct answer as high-quality despite potentially flawed intermediate steps, leading to noisy training signals. In this work, we propose K-STaR (Knowledge-aware Self-Taught Reasoner), a self-training framework that verifies reasoning paths through knowledge elicitation and integration as a proxy, without requiring any external reward models or dense step-by-step annotations. K-STaR models reasoning as a structured composition of knowledge units and automatically assigns process rewards to intermediate steps via consistency and frequency analysis, ensuring that only knowledge-grounded reasoning paths are retained. Experiments on mathematical and commonsense reasoning tasks show that K-STaR consistently discovers higher-quality reasoning paths and achieves superior self-training performance compared to prior methods. Our results highlight the importance of moving beyond correctness-centric supervision toward knowledge-grounded self-improvement.

K-STaR: Knowledge-Aware Self-Taught Reasoner

In cooperative video games, traditional AI companions are deployed to assist players, who control them using hotkeys or command wheels to issue predefined commands such as ''attack'', ''defend'', or ''retreat''. Despite their simplicity, these methods, which lack target specificity, limit players' ability to give complex tactical instructions and hinder immersive gameplay experiences. To address this, we propose the FPS AI Companion who Understands Language (F.A.C.U.L.), the first real-time AI system that enables players to communicate and collaborate with AI companions using natural language. By integrating natural language processing with a confidence-based framework, F.A.C.U.L. efficiently decomposes complex commands and interprets player intent. It also employs a dynamic entity retrieval method for environmental awareness, aligning human intentions with decision-making. Unlike traditional rule-based systems, our method supports real-time language interactions, enabling players to issue complex commands such as ''clear the second floor,'' ''take cover behind that tree,'' or ''retreat to the river''. The system provides real-time behavioral responses and vocal feedback, ensuring seamless tactical collaboration. Using the popular FPS game Arena Breakout: Infinite as a case study, we present comparisons demonstrating the efficacy of our approach and discuss the advantages and limitations of AI companions based on real-world user feedback.

F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming

Accurate forecasting of multivariate time series data remains a formidable challenge, particularly due to the growing complexity of temporal dependencies in real-world scenarios. While neural network-based models have achieved notable success in this domain, complex channel-dependent models often suffer from performance degradation compared to channel-independent models that do not consider the relationship between components but provide high robustness due to small capacity. In this work, we propose HN-MVTS, a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model. The input of this hypernetwork is a learnable embedding matrix of time series components. To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks, serving as a data-adaptive regularizer that improves generalization and long-range predictive accuracy. The hypernetwork is used only during the training, so it does not increase the inference time compared to the base forecasting model. Extensive experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) consistently improves their performance.

HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting

The proliferation of facial recognition (FR) systems has raised privacy concerns in the digital realm, as malicious uses of FR models pose a significant threat. Traditional countermeasures, such as makeup style transfer, have suffered from low transferability in black-box settings and limited applicability across various demographic groups, including males and individuals with darker skin tones. To address these challenges, we introduce a novel facial privacy protection method, dubbed MAP, a pioneering approach that employs human emotion modifications to disguise original identities as target identities in facial images. Our method uniquely fine-tunes a score network to learn dual objectives, target identity and human expression, which are jointly optimized through gradient projection to ensure convergence at a shared local optimum. Additionally, we enhance the perceptual quality of protected images by applying local smoothness regularization and optimizing the score matching loss within our network. Empirical experiments demonstrate that our innovative approach surpasses previous baselines, including noise-based, makeup-based, and freeform attribute methods, in both qualitative fidelity and quantitative metrics. Furthermore, MAP proves its effectiveness against an online FR API and shows advanced adaptability in uncommon photographic scenarios. Our code is available at: https://anonymous.4open.science/r/MAP-REVIEW.

Machine Pareidolia: Protecting Facial Image with Emotional Editing

Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.

Downloads

Next from AAAI 2026

UQ-Bench: A Benchmark for Evaluating Multimodal LLMs on Underwater Image Quality Assessment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

UQ-Bench: A Benchmark for Evaluating Multimodal LLMs on Underwater Image Quality Assessment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads