Singapore

Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary timepoints. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios.We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) Scene-aware segmentation. Vista dynamically clusters incoming frames into temporally and visually coherent scene units. (2) Scene-aware compression. Each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while the full-resolution frames are offloaded to CPU memory. (3) Scene-aware recall. Upon receiving a question, relevant scenes are selectively recalled and reintegrated into the model’s input space, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.

AAAI 2026

Vista: Scene-Aware Optimization for Streaming Video Question Answering Under Post-Hoc Queries

ml: large multimodal models (lmms).

cv: visual reasoning & symbolic representations

cv: video understanding & activity analysis

cv: multi-modal vision

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Continual learning methods used to force neural networks to process sequential tasks in isolation, preventing them from leveraging useful inter-task relationships and causing them to repeatedly relearn similar features or overly differentiate them. To address this problem, we propose a fully differentiable, exemplar-free expandable method composed of two complementary memories: One learns common features that can be used across all tasks, and the other combines the shared features to learn discriminative characteristics unique to each sample. Both memories are differentiable so that the network can autonomously learn latent representations for each sample. For each task, the memory adjustment module adaptively prunes critical slots and minimally expands capacity to accommodate new concepts, and orthogonal regularization enforces geometric separation between preserved and newly learned memory components to prevent interference. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that the proposed method outperforms 14 state-of-the-art methods for class-incremental learning, achieving final accuracies of 55.13\%, 37.24\%, and 30.11\%, respectively. Additional analysis confirms that, through effective integration and utilization of knowledge, the proposed method can increase average performance across sequential tasks, and it produces feature extraction results closest to the upper bound, thus establishing a new milestone in continual learning.

Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning

Personalized Federated Learning (PFL) customizes models for each client to mitigate challenges from non-IID data, wherein a dominant strategy is model decoupling that partitions models into shared and personalized parts based on architectural priors (e.g., backbone vs. head). However, we reveal a critical flaw in this strategy: it induces "intrinsic drift," a performance degradation often more severe than the well-known client drift, which limits final accuracy. We trace this drift to a steep cliff of high loss emerging from the naive stitching of shared and personalized parts. To address this, we shift from architectural partitioning to a parameter behavior-driven paradigm. We introduce PPFL, an approach that employs a novel soft-fusion strategy guided by parameter-wise behavioral perception. PPFL dynamically infers each parameter's functional role—whether it behaves more like a 'personalist' or a 'generalist' in the current context—by synthesizing its multifaceted behavior observed during local training. Extensive experiments on image, text, and multimodal classification benchmarks show that PPFL outperforms eight state-of-the-art baselines by up to 5.3\%. Moreover, it can function as a plug-in module, boosting the accuracy of vanilla FedAvg with a 16.82\% absolute gain.

PPFL: A Parameter Behavior-Driven Plug-in Personalization Engine for Federated Learning

Graph-based vertical federated learning (GVFL) enables collaboration by incorporating scattered attributes and adjacency relations from aligned nodes, and allows each party to contribute its personalized input embedding to joint training and inference. The injection of adversarial inputs can mislead the inference towards attacker’s will, forcing other benign parties to make negligible contributions and losing rewards regarding the importance of their contributions. However, most attacks require server model architectures, queries, or labeled auxiliary graphs from the training domain. These extra requirements are not practical for real-world GVFL applications. In this paper, we propose SGAC, a novel attack framework for crafting adversarial inputs to dominate joint inference without relying on such above requirements. SGAC advances prior attacks by requiring only access to auxiliary graphs from non-training domains. SGAC learns generalized label-indicative embeddings and estimates class-transferable probabilities across domains to generate a surrogate model that closely approximates the server model. SGAC then emphasizes salient node attributes and edges in the auxiliary graph, creating a diverse shadow input set that resembles influential test inputs. With surrogate fidelity and input diversity, SGAC crafts transferable adversarial inputs. Evaluation on diverse model architectures confirms the effectiveness of SGAC.

Generic Adversarial Attack Framework Against Graph-based Vertical Federated Learning

Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.

Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

The increasing prominence of short video platforms has positioned them as a primary channel for public awareness of current events, while also facilitating the widespread dissemination of fake news, thus highlighting the critical need for automated detection technologies. In contrast to fake news confined to text and images, short video news encompasses multiple modalities and extensive information, presenting heightened challenges. Most existing research emphasizes the analysis of news content or user comments alone, while overlooking the crucial role of publishers, leading to poor model performance when handling fake news lacking obvious false signals. Therefore, we propose a Publisher Profiling Module to identify new false signals. To enable a more comprehensive detection of misinformation, we design a Multi-View Aggregation (MVA) model, simultaneously evaluating news from three distinct perspectives: sentiment analysis, content understanding, and publisher profiling. Late fusion is applied at the decision level to leverage the complementary strengths of these perspectives, addressing the limitations of single-view methods. Our experiments conducted on the FakeSV and FVC datasets demonstrate the superior performance of the proposed method.

Detecting Fake News in Short Videos Through Multi-View Aggregation

Pre-trained models have demonstrated exceptional generalization capabilities in time-series forecasting; however, adapting them to evolving data distributions remains a significant challenge. A key hurdle lies in accessing the original training data, as fine-tuning solely on new data often leads to catastrophic forgetting. To address this issue, we propose Replay Tuning (R-Tuning), a novel framework designed for the continual adaptation of pre-trained time-series models.
R-Tuning constructs a unified latent space that captures both prior and current task knowledge through a frequency-aware replay strategy. Specifically, it augments model-generated samples via wavelet-based decomposition across multiple frequency bands, generating trend-preserving and fusion-enhanced variants to improve representation diversity and replay efficiency. To further reduce reliance on synthetic samples, R-Tuning introduces a latent consistency constraint that aligns new representations with the prior task space. This constraint guides joint optimization within a compact and semantically coherent latent space, ensuring robust knowledge retention and adaptation.
Extensive experimental results demonstrate the superiority of R-Tuning, which reduces MAE and MSE by up to 46.9% and 46.8%, respectively, on new tasks, while preserving prior knowledge with gains of up to 5.7% and 6.0% on old tasks. Notably, under few-shot settings, R-Tuning outperforms all state-of-the-art baselines even when synthetic proxy samples account for only 5% of the new task dataset. Implementation details and code are provided in the supplementary material.

R-Tuning: Wavelet-Decomposed Replay and Semantic Alignment for Continual Adaptation of Pretrained Time-Series Models

Workflow automation promises substantial productivity gains in everyday document-related tasks. While prior agentic systems can execute isolated instructions, they struggle with automating multi-step, session-level workflows due to limited control over the operational process. To this end, we introduce AutoDW, a novel execution framework that enables stepwise, rollback-enabled operation orchestration. AutoDW incrementally plans API actions conditioned on user instructions, intent-filtered API candidates, and the evolving states of the document. It further employs robust rollback mechanisms at both the argument and API levels, enabling dynamic correction and fault tolerance. These designs together ensure that the execution trajectory of AutoDW remains aligned with user intent and document context across long-horizon workflows. To assess its effectiveness, we construct a comprehensive benchmark of 250 sessions and 1,708 human-annotated instructions, reflecting realistic document processing scenarios with interdependent instructions. AutoDW achieves 90% and 62% completion rates on instruction- and session-level tasks, respectively, outperforming strong baselines by 40% and 76%. Moreover, AutoDW also remains robust for the decision of backbone LLMs and on tasks with varying difficulty. Code and data will be open-sourced.

Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operation Orchestration

Unsupervised Change Detection (UCD) in Very High Resolution (VHR) Remote Sensing (RS) images remains to be a difficult challenge due to the inherent spatio-temporal complexity within data. Inspired by recent advancements in Visual Foundation Models (VFMs) and Contrastive Learning (CL) methodologies, this research aims to develop CL methodologies to translate implicit knowledge in VFM into change representations, thus eliminating the need for explicit supervision. To this end, we introduce a Semantic-to-Change (S2C) learning framework for UCD in VHR RS images. Differently from existing CL methodologies that typically focus on learning multi-temporal similarities, we introduce a novel triplet learning strategy that explicitly models temporal differences, which are crucial to the CD task. Furthermore, random spatial and spectral perturbations are introduced during the training to enhance robustness to temporal noise. In addition, a grid sparsity regularization is defined to suppress insignificant changes, and an IoU-matching algorithm is developed to refine the CD results. Experiments on three benchmark CD datasets demonstrate that the proposed S2C learning framework achieves significant improvements in accuracy, surpassing current state-of-the-art by over 31%, 9% and 23%, respectively. It also demonstrates robustness and sample efficiency, suitable for training and adaptation of various Visual Foundation Models (VFMs) or backbone neural networks.

S2C: A Noise-Resistant Difference Learning Framework for Unsupervised Change Detection in VHR Remote Sensing Images

The spiking neuron model (SNM) mimics the processing paradigm of synaptic and membrane potentials in the cerebral cortex. However, existing SNMs are limited by two issues. First, they lack spike diversity. Although a spiking neuron perceives temporally varying input currents, SNMs only use identical synaptic weights for regulation. Second, they are insensitive to weak spikes. The potential accumulation in SNMs is solely driven by external inputs, ignoring the internal dynamics of potential. Oligodendrocytes, a recent revelation in neuroscience, enhance neural signaling by forming bidirectional communication. This offers the potential to alleviate the aforementioned issues. In this paper, we first propose the mechanism of the oligodendrocyte-spiking neuron (Oli-N) model. Subsequently, using the Oli-N model, we develop our Oli-inspired spiking neural network (Oli-SNN), which broadens the diversity of spike representations and enhances neurons' firing precision through improved sparse coding to enhance weak spikes. Experiments show that our Oli-SNN achieves state-of-the-art performance in the classification task on both static and neuromorphic datasets.

Oligodendrocyte-Driven Spiking Neural Model

Predicting drug–target interactions (DTIs) is a fundamental task in computational drug discovery, yet it remains challenging under distribution shifts and limited training data. Existing approaches often suffer from poor generalization, weak cross-modal alignment between molecular and protein representations, and vulnerability to noisy supervision.We propose ESP-DTI, a unified framework designed to enhance generalization by integrating large-scale protein language models with curriculum learning and cross-modal contrastive alignment. Specifically, we leverage ESM-2 to encode context-aware protein representations and adopt a CLIP-style contrastive objective to align drug and protein embeddings in a shared latent space. To further improve learning robustness, we introduce a progressive curriculum sampling strategy that dynamically schedules training instances based on model confidence, enabling a gradual shift from easy to hard examples.Experimental results on four benchmark datasets demonstrate that ESP-DTI consistently outperforms state-of-the-art baselines, achieving a +3.1% improvement in average accuracy. Ablation studies confirm the complementary benefits of each component, validating their collective contribution to robust and generalizable DTI prediction.Our work underscores the effectiveness of combining pretrained protein language models with structured training curricula and cross-modal contrastive learning for reliable DTI prediction under real-world, distribution-shifted conditions.The source code is available at https://anonymous.4open.science/r/ESP-DTI-C926

Content not yet available

Next from AAAI 2026

Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES