Singapore

Multimodal video recommendation systems face fundamental challenges in determining optimal fusion strategies across diverse content types and user preferences. Existing methods suffer from two critical limitations: (1) their fusion strategies are guided by context-agnostic priors that ignore the semantic structure of content, assuming the same simple distribution (typically $\mathcal{N}(0, I)$) governs optimal fusion for all video types, and (2) their optimization objectives, particularly the Evidence Lower Bound (ELBO), are misaligned with the final recommendation goal, optimizing for feature reconstruction rather than ranking performance. To address these fundamental issues, this work proposes VBF++, a novel framework that introduces context-aware structured priors and recommendation-guided adversarial refinement. First, the method designs context-aware priors that learn cluster-specific distributions based on video semantic categories, replacing uninformative priors with structured, content-aware prior distributions. Second, it introduces a Recommendation-Guided Adversarial Refinement (RAR) paradigm that explicitly steers the learning process towards generating recommendation-optimal fusion strategies, resolving the objective misalignment inherent in variational learning. Enhanced with domain-adaptive meta-learning, extensive experiments on three real-world datasets demonstrate consistent improvements of 4.7-8.3\% in Precision@10 over state-of-the-art methods. Analysis reveals that learned fusion strategies exhibit semantically meaningful patterns, prioritizing visual features for action content, acoustic information for music videos, and textual descriptions for documentary material.

AAAI 2026

VBF++: Variational Bayesian Fusion with Context-Aware Priors and Recommendation-Guided Adversarial Refinement for Multimodal Video Recommendation

multimedia & multimodal data

mining of visual

multimodal learning

recommender systems

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions.
To address these, we propose **SEED**, a Spectral Entropy-guided evaluation framework for spatial-temporal dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features.
Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

SEED: Spectral Entropy-Guided Evaluation of Spatial-Temporal Dependencies for Multivariate Time Series Forecasting

Large Language Models face fundamental deployment challenges due to the computational demands of auto-regressive token-by-token generation. While speculative decoding has emerged as a promising acceleration technique through its draft-then-verify framework, current implementations suffer from two critical limitations: (1) mutual waiting problem caused by sequential dependencies between draft generation and verification phases, and (2) constrained token acceptance rates where retrieval-based drafting methods under-perform in general domains while models-based drafting approaches show reduced efficacy in knowledge-intensive scenarios. To address these challenges, we propose Talon, a novel parallel inference architecture featuring two key innovations: (1) **a novel asynchronous execution paradigm** that decouples draft generation from verification, effectively eliminating synchronization bottlenecks, and (2) **an adaptive hybrid drafting strategy** that dynamically combines model-based and retrieval-based approaches to improve token acceptance rates across diverse domains. Extensive evaluations across standard benchmarks (MT-Bench, HumanEval, GSM8K, Alpaca, CNN/DM) demonstrate Talon's exceptional performance, achieving 4.04x–6.52x acceleration across multiple model families including Vicuna, Deepseek, and LLaMA series. These results represent a significant advancement over existing speculative decoding methods (EAGLE 1-3, Hydra, Medusa, Lookahead, SPS, and PLD), establishing a new paradigm for speculative decoding.

Talon: Breaking the Synchronization Barrier in Speculative Decoding with Hybrid Model-based and Retrieve-based Drafting

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce Placeholder-RAG-Benchmark, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems.

PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Retrieval-augment generation is a prevalent strategy to mitigate hallucinations of LLMs. The attributable RAG (RAGQ) generates quotes for its answers. The quotes indicate which input contexts support the RAG to derive the answers, enhancing the answer's verifiability and trustworthiness. However, existing RAGQs exhibit significant degradation when dealing with questions that require multi-hop reasoning and multi-modal understanding, suffering from over-citation, implicit entity identification failure, and poor generalization. In this paper, we propose a novel RAGQ framework, namely QDRAG. QDRAG breaks down the input question into atomic subquestions to identify the implicit entities. Then, the reranker prunes context distractors to eliminate the downstream over-citation. To facilitate query decomposition, we propose two zero-shot approaches: QD-C and QD-R, which guide the QD MLLM to decompose the question based on context knowledge and retrieval rewards, respectively. One interesting finding is that finetuning on the QD task shows better generalizability compared to directly finetuning on the downstream RAGQ task. Experiments on four multi-modal QA benchmarks demonstrate QDRAG's efficacy in grounding answers and generating faithful citations. The framework significantly outperforms all the baselines on both in-domain and out-of-domain tests, even surpassing Gemini-Pro.

Faithful in Steps: Improving Generalization and Citation in RAG via Query Decomposition

While cooperative perception can overcome the limitations of single-vehicle systems, the practical implementation of vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) systems is often impeded by significant economic barriers. Aerial-ground cooperation (AGC), which pairs ground vehicles with drones, presents a more economically viable and rapidly deployable alternative. However, this emerging field has been held back by a critical lack of high-quality public datasets and benchmarks.
To bridge this gap, we present Griffin, a comprehensive AGC 3D perception dataset, featuring over 250 dynamic scenes (37k+ frames). It incorporates varied drone altitudes (20-60m), diverse weather conditions, realistic drone dynamics via CARLA-AirSim co-simulation, and critical occlusion-aware 3D annotations. Accompanying the dataset is a unified benchmarking framework for cooperative detection and tracking, with protocols to evaluate communication efficiency, altitude adaptability, and robustness to communication latency, data loss and localization noise. By experiments through different cooperative paradigms, we demonstrate the effectiveness and limitations of current methods and provide crucial insights for future research.

Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark

The proliferation of multi-modal data on the internet has intensified the need for structured event understanding across text and visual modalities. However, existing cross-modal event extraction models suffer from three major limitations: the absence of explicit event schema guidance, coarse-grained multi-modal alignment strategies, and reliance on heterogeneous, misaligned multi-modal training dataset. To address these issues, we propose a Multi-modal Schema-Guided Progressive Instruction Tuning framework (LLaVA-MS-PIT) that explicitly injects structured multi-modal event schema knowledge into the model before event extraction. Specifically, we introduce textual event schema to establish the model’s prior knowledge of event information and enhance its ability for event structure reasoning, while visual event schema is employed to bridge the representational gap between textual and visual modalities at the event level, enabling unified and semantically aligned event representations across modalities. Furthermore, to alleviate these challenges of data scarcity and modality misalignment inherent in current benchmarks, we further construct imSitu-MME, a high-quality multi-modal parallel dataset constructed via schema-guided data generation and annotation. Extensive experiments demonstrate that LLaVA-MS-PIT achieves competitive performance on multi-modal event extraction benchmarks, indicating the effectiveness and necessity of schema-guided progressive instruction tuning.

LLaVA-MS-PIT: Multi-Modal Schema-Guided Progressive Instruction Tuning for Multi-Modal Event Extraction

Benchmarks serve as standardized test systems to distinguish capabilities among large language models (LLMs). Discriminative items enable high-ability LLMs to favor correct answers, while causing low-ability models to assign lower plausibility to these answers and tend toward incorrect answers. Current methods for assessing benchmark quality primarily focus on coverage of difficulty levels and task diversity, yet lack direct quantification of discrimination—the core metric. Furthermore, large-scale benchmarks incur high evaluation costs. Although heuristic methods can reduce item counts to some extent, they cannot guarantee preservation of the benchmark’s original discriminative properties. To address these limitations, we propose MetaEval, a meta-evaluation framework designed to precisely quantify per-item discrimination and enable efficient assessment. Central to MetaEval is our novel Signal Detection and Item Response (SD-IR) model, which simulates LLMs’ detection of correct answers (signals) by representing each model’s perception through two latent ability states: “known” and “unknown”. For any item, discrimination is quantified as the difference in signal plausibility between these states. Leveraging these discrimination metrics, MetaEval introduces two strategies to replicate full-benchmark results using minimal subsets for efficient evaluation: (1) Distilling metaBench: a compact subset that retains discriminative power by removing redundant items; (2) Predicting performance on full-benchmark based on metaBench’s discrimination. Experiments across five benchmarks confirm that high-discrimination items capture greater performance variation among LLMs, align more closely with full-benchmark rankings, and exhibit superior predictive ability. 
Notably, in the best case, MetaEval achieves accurate full-benchmark estimation using only 2.5% of items, substantially reducing evaluation costs while preserving reliability.

MetaEval: Measuring the Discrimination of Benchmarks for Efficient LLM Evaluation

The Job-Shop Scheduling Problem (JSSP), under various forms of manufacturing uncertainty, has recently attracted considerable research attention. Most existing studies focus on parameter uncertainty, such as variable processing times, and typically adopt the actor-critic framework. In this paper, we explore a different but prevalent form of uncertainty in JSSP: structural uncertainty. Structural uncertainty arises when a job may follow one of several routing paths, and the selection is determined not by policy, but by situational factors (e.g., the quality of intermediate products) that cannot be known in advance. Existing methods struggle to address this challenge due to incorrect credit assignment: a high-quality action may be unfairly penalized if it is followed by a time-consuming path. To address this problem, we propose a novel method named UP-AAC. In contrast to conventional actor-critic methods, UP-AAC employs an asymmetric architecture. While its actor receives a standard stochastic state, the critic is crucially provided with a deterministic state reconstructed in hindsight. This design allows the critic to learn a more accurate value function, which in turn provides a lower-variance policy gradient to the actor, leading to more stable learning. In addition, we design an attention-based Uncertainty Perception Model (UPM) to enhance the actor's scheduling decisions. Extensive experiments demonstrate that our method outperforms existing approaches in reducing makespan on benchmark instances.

Learning to Optimize Job Shop Scheduling Under Structural Uncertainty

Meta-learning for Bayesian optimization accelerates optimization by leveraging knowledge from previous tasks, but existing methods optimize for average performance and fail on challenging outlier tasks critical in practice. These limitations become particularly severe when target tasks exhibit distribution shifts or when optimization budgets are limited in real-world applications. We introduce MetaGameBO, a hierarchical game-theoretic framework that formulates meta-learning as robust optimization through CVaR-based task selection and diversity-aware sample learning. Our approach incorporates uncertainty-aware adaptation via probabilistic embeddings and Thompson sampling for robust generalization to out-of-distribution targets. We establish theoretical guarantees including convergence to game-theoretic equilibria and improved sample complexity, and demonstrate substantial improvements with 95.7\% reduction in average loss and 88.6\% lower tail risk compared to state-of-the-art methods on challenging tasks and distribution shifts.

MetaGameBO: Hierarchical Game-Theoretic Driven Robust Meta-Learning for Bayesian Optimization

Existing multimodal representation learning approaches often rely on simple feature concatenation or unified transformations, which fail to effectively disentangle and leverage common and private information across different modalities in a progressive manner. Moreover, they typically lack adaptive modeling tailored to specific task requirements. To address these limitations, we propose a Prototype-Induced Label Structuring for Disentangled Multimodal Representation Network (PLUM-Net). It first employs a multilevel semantic alignment module to synchronize global and local semantics across audio, visual and textual streams. On this aligned foundation, a prototype-based single-modal label generation module derives modality-specific hard and soft-labels that subtly steer the network toward a cleaner split between shared and private cues. Guided by these labels, the task-conditioned feature bifurcator module channels information through the most beneficial common or private pathway for the given task, after which a private refinement module polishes and fusion each modality’s idiosyncratic signals. Extensive experiments show that PLUM-Net delivers strong performance on datasets such as CMU-MOSI, CMU-MOSEI and UR-FUNNY, achieving an ACC-2 of 90.3% on CMU-MOSI and 83.2% on UR-FUNNY .

Downloads

Next from AAAI 2026

SEED: Spectral Entropy-Guided Evaluation of Spatial-Temporal Dependencies for Multivariate Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

SEED: Spectral Entropy-Guided Evaluation of Spatial-Temporal Dependencies for Multivariate Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads