Singapore

Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with answering complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead, forcing severe downsampling of long videos. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues—ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships (e.g., &quot;before X,&quot; &quot;after Y&quot;). As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments that satisfy the question&#39;s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile long-form VQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step compositional reasoning.

AAAI 2026

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

cv: video understanding & activity analysis cv: visual reasoning & symbolic representations ml: neuro-symbolic learning

Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with answering complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead, forcing severe downsampling of long videos. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues—ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships (e.g., "before X," "after Y"). As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments that satisfy the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile long-form VQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step compositional reasoning.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recent advances in large language models (LLMs) have shown great potential to accelerate drug discovery. However, the specialized nature of biochemical data often necessitates costly domain-specific fine-tuning, posing critical challenges. First, it hinders the application of more flexible general-purpose LLMs in cutting-edge drug discovery tasks. More importantly, it limits the rapid integration of the vast amounts of scientific data continuously generated through experiments and research. Compounding these challenges is the fact that real-world scientific questions are typically complex and open-ended, requiring reasoning beyond pattern matching or static knowledge retrieval. To address these challenges, we propose CLADD, a retrieval-augmented generation (RAG)-empowered agentic system tailored to drug discovery tasks. Through the collaboration of multiple LLM agents, CLADD dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence to generate responses - all without the need for domain-specific fine-tuning. Crucially, we tackle key obstacles in applying RAG workflows to biochemical data, including data heterogeneity, ambiguity, and multi-source integration. We demonstrate the flexibility and effectiveness of this framework across a variety of drug discovery tasks, showing that it outperforms general-purpose and domain-specific LLMs as well as traditional deep learning approaches.

RAG-Enhanced Collaborative LLM Agents for Drug Discovery

Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals.
Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

Federated graph learning (FGL) is a distributive framework for graph representation learning that prioritizes privacy preservation. The right to be forgotten embodies the ethical principle of prioritizing user autonomy over data usage. In the context of FGL, upholding this right requires the method to remove specific entities and their associated knowledge within local subgraphs (Meta Unlearning) and the complete erasure of the entire client (Client Unlearning).
We are the first to systematically define the above two unlearn requests in federated graph unlearning.
Several studies have attempted to address this challenge, but key limitations persist: incomplete unlearning support and residual knowledge permeation. 
To this end, we propose a \textbf{P}rototype-guided \textbf{A}dversarial \textbf{G}raph \textbf{E}raser for universal federated graph unlearning (\textbf{PAGE}), the first unified federated graph unlearning framework that extend to comprehensive unlearning requests. 
For meta unlearning, we employ the prototype gradients guide initial local unlearn, while adversarial graphs eliminate residual knowledge across the influenced clients. For client unlearning, PAGE exclusively utilizes adversarial graph generation to purge a departed client's influence from the remaining participants.
PAGE outperforms existing methods on 8 benchmark datasets. It improves prediction accuracy by 5.08\% (client unlearn) and 1.50\% (meta-unlearn), with up to 11.84\% gain on large-scale graphs.
Furthermore, ablation studies confirm its efficacy as a plug-in for other meta unlearn methods, boosting prediction performance up to 4.49\% and unlearning performance up to 7.22\%.

PAGE: A Unified Approach for Federated Graph Unlearning

Greedy search methods like Greedy Best-First Search (GBFS) and Enforced Hill-Climbing (EHC) often struggle when faced with Uninformed Heuristic Regions (UHRs) like heuristic local minima or plateaus. In this work, we theoretically and empirically compare two popular methods for escaping UHRs in breadth-first search (BrFS) and restarting random walks (RRWs). We first derive the expected runtime of escaping a UHR using BrFS and RRWs, based on properties of the UHR and the random walk procedure, and then use these results to identify when RRWs will be faster in expectation than BrFS. We then evaluate these methods for escaping UHRs by comparing standard EHC, which uses BrFS to escape UHRs, to variants of EHC called EHC-RRW, which use RRWs for that purpose. EHC-RRW is shown to have strong expected runtime guarantees in cases where EHC has previously been shown to be effective. We also run experiments with these approaches on PDDL planning benchmarks to better understand their relative effectiveness for escaping UHRs.

Breadth-First Search vs. Restarting Random Walks for Escaping Uninformed Heuristic Regions

We introduce a biologically inspired, multi-layer neural architecture built from Rectified Spectral Units (ReSUs). Each ReSU projects a recent window of its input history onto a canonical direction learned by the canonical correlation analysis (CCA) of previously observed past-future input pairs and then rectifies either the positive or negative component. Because synaptic weights are obtained via past-future CCA on the pre-synaptic activity, ReSU networks offer a potentially local, self-supervised algorithm for the progressive construction of increasingly complex features. To assess both computational power and biological fidelity, we trained a two-layer ReSU network in a self-supervised regime on translating natural scenes. First-layer units, each driven by a single pixel, developed temporal filters matching those of \textit{Drosophila} post-photoreceptor neurons (L1/L2 and L3), including their empirically measured adaptation to signal‑to‑noise‑ratio. Second-layer units, pooling spatially over the first layer, became direction-selective, reminiscent of T4 motion-detecting cells, with learned synaptic weights approximating known patterns in the \textit{Drosophila} connectome. These results demonstrate that ReSU networks may provide: (i) a principled framework for modeling sensory circuits, (ii) a back-prop-free self-supervised paradigm for constructing deep artificial neural networks.

A Network of Biologically Inspired Rectified Spectral Units (ReSUs) Learns Hierarchical Features Without Error Backpropagation

In recent years, electroencephalography (EEG)-based visual decoding research has become a key direction for revealing brain processing mechanisms and realizing brain-computer interfaces. This emerging field has attracted extensive attention in the fields of brain science, cognitive neuroscience, and artificial intelligence. Among various approaches, contrastive learning has demonstrated strong performance in aligning multi-modal data, effectively enabling unified representations across modalities. However, during human visual perception, images are often subject to varying degrees of blurring due to the uneven distribution of retinal photoreceptor cells and the limited speed of lens accommodation. To address the mismatch between EEG and visual representations, we propose a novel visual decoding framework inspired by human perceptual blurring. Specifically, multi-level Gaussian blurring is applied to the visual stimuli to simulate human visual characteristics, followed by a feature selection module to construct robust visual representations. For EEG decoding, we design a lightweight and efficient network employing positively constrained spatial convolutions to identify channels associated with visual processing. The EEG and visual features are then aligned using contrastive learning. We evaluate the proposed framework on the Things-EEG dataset. Experimental results show significant improvements in the zero-shot brain-to-image retrieval task, achieving a top-1 accuracy of 80\% and a top-5 accuracy of 96.9\%, surpassing previous state-of-the-art methods by margins of 29.1\% and 17.2\%, respectively. These findings highlight the potential of incorporating perceptual properties into EEG-based visual decoding.

Leveraging Visual Blur Perception Characteristics for EEG Decoding

Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training–inference gap and lack the capacity for fine-grained token selection across multiple dimensions—such as queries, key-values (KV), and heads—leading to suboptimal performance and 
acceleration gains.
In this paper, we introduce \texttt{OmniSparse}, a training-aware fine-grained sparse attention of long-video MLLMs, which is applied in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection as lazy-active classification, aiming to retain active queries that capture broader semantic similarity, while discarding most of lazy ones that focus on limited local context and exhibit high functional redundancy with their neighbors, (2) KV selection with head-level dynamic budget allocation, where a shared budget is determined based on the flattest head and applied uniformly across all heads to ensure attention recall after selection, and (3) KV cache slimming to alleviate head-level redundancy, which selectively fetches visual KV cache according to the head-level decoding query pattern.
Experimental results demonstrate that OmniSparse can achieve comparable performance with full attention, achieving 2.7$\times$ speedup during prefill and 2.4$\times$ memory reduction for decoding.

OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs

Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the $\lambda$-independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming $O(k\log n)$ memory. It can also be readily adapted to solve the offline fair $k$-center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for $m = 2$ and a 4-approximation algorithm for general $m$. Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.

Improved Streaming Algorithm for Fair k-Center Clustering

Multi-agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro-level orchestration while overlooking agents’ internal deliberative capabilities. This critical meta-cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta-Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high-level meta-cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a a 4--5\% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta-cognitive policies for multi-agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.

Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning

Existing methods for jailbreaking a Large Language Model (LLM) have largely focused on disguising a harmful request as benign, either through a single interaction with the LLM (as in single-turn methods) or through multiple interactions (as in multi-turn methods). In this paper, we propose Contextual History for Adaptive and Simple Exploitation (CHASE), a novel method for LLM jailbreaking that extends the success of existing multi-turn methods by showing that the conversational history of an LLM can additionally be exploited profitably to increase the chances of successful jailbreaking. To our knowledge, CHASE represents the first attempt to address LLM jailbreaking by considering both the linguistic aspect (i.e., how to linguistically disguise a harmful request as benign) and the extra-linguistic aspect (i.e., exploiting the conversational history of an LLM) of the problem.

Content not yet available

Next from AAAI 2026

RAG-Enhanced Collaborative LLM Agents for Drug Discovery

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES