Singapore

Massive multi-modality datasets are fundamental to the success of large video-language models. However, existing datasets often focus on providing textual descriptions for visual content, treating audio, particularly music, as weakly related information. This overlooks the inherent semantic correlation between visual narratives and musical scores, limiting the development of models for fine-grained cross-modal understanding and generation. To address this gap, we introduce VMChill, a large-scale, fine-grained multimodal video dataset. We leverage trailers as our data source, as they are professionally edited to create a strong synergy between visual pacing, scene transitions, and background music for narrative and emotional impact. Our dataset comprises over 20 million video clips derived from more than 27.1k hours of high-resolution trailer videos.
To annotate this data, we propose a systematic multimodal captioning framework. This framework first employs specialized unimodal models to extract descriptive features from multiple perspectives, including visual content, motion dynamics, and musical attributes (e.g., genre, instruments, mood). Subsequently, a large language model (LLM) is utilized to adaptively fuse these diverse descriptions into a single, coherent, and rich multimodal caption. This process yields VMChill-2M, a high-quality subset of 2 million clips with detailed multimodal annotations, and VMChill-Test, a manually refined test set for evaluation. We conduct extensive experiments on downstream tasks, including video understanding and generation, to establish benchmarks and demonstrate the dataset&#39;s quality. The results validate that VMChill effectively enhances model performance, highlighting its potential to facilitate future research in fine-grained multimodal learning. We will release the dataset, annotation codebase, and processing pipelines to support community research.

AAAI 2026

VMChill: A Dataset for Fine-Grained Visual-Musical Synergy

movie generation

video generation

Massive multi-modality datasets are fundamental to the success of large video-language models. However, existing datasets often focus on providing textual descriptions for visual content, treating audio, particularly music, as weakly related information. This overlooks the inherent semantic correlation between visual narratives and musical scores, limiting the development of models for fine-grained cross-modal understanding and generation. To address this gap, we introduce VMChill, a large-scale, fine-grained multimodal video dataset. We leverage trailers as our data source, as they are professionally edited to create a strong synergy between visual pacing, scene transitions, and background music for narrative and emotional impact. Our dataset comprises over 20 million video clips derived from more than 27.1k hours of high-resolution trailer videos.
To annotate this data, we propose a systematic multimodal captioning framework. This framework first employs specialized unimodal models to extract descriptive features from multiple perspectives, including visual content, motion dynamics, and musical attributes (e.g., genre, instruments, mood). Subsequently, a large language model (LLM) is utilized to adaptively fuse these diverse descriptions into a single, coherent, and rich multimodal caption. This process yields VMChill-2M, a high-quality subset of 2 million clips with detailed multimodal annotations, and VMChill-Test, a manually refined test set for evaluation. We conduct extensive experiments on downstream tasks, including video understanding and generation, to establish benchmarks and demonstrate the dataset's quality. The results validate that VMChill effectively enhances model performance, highlighting its potential to facilitate future research in fine-grained multimodal learning. We will release the dataset, annotation codebase, and processing pipelines to support community research.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision.
Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance.
We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionly performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class.
Built upon a Schrödinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.
Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Social choice theory offers a wealth of approaches for selecting a candidate on behalf of voters based on their reported preference rankings over options. When voters have explicit utilities for these options, however, using preference rankings may lead to suboptimal outcomes vis-a-vis utilitarian social welfare. Distortion is a measure of this suboptimality, and an extensive literature 
uses it to develop and analyze voting rules when utilities have minimal structure. However, in many settings, such as common paradigms for value alignment, available options admit a vector representation, and it is natural to suppose that utilities are parametric functions thereof.

We undertake the first study of distortion for linear utility functions. Our theoretical contributions are organized into two parts: randomized and deterministic voting rules. We obtain bounds that depend only on dimension of the candidate embedding, and are independent of the numbers of candidates or voters. Additionally, we introduce poly-time instance-optimal algorithms for minimizing distortion given a collection of candidates and votes. We empirically evaluate these in two real-world domains: recommendation systems using collaborative filtering embeddings, and opinion surveys utilizing language model embeddings. Our results benchmark the distortion bounds of several standard rules against our instance-optimal algorithms.

Optimized Distortion in Linear Social Choice

Structural measures of graphs, such as treewidth, are central tools in computational complexity resulting in efficient algorithms when exploiting the parameter. It is even known that modern SAT solvers work efficiently on instances of small treewidth. Since these solvers are widely applied in knowledge representation reasoning and symbolic AI, research interests in compact encodings for solving and to understand encoding limitations. Even more general is the graph parameter clique-width, which unlike treewidth can be small for dense graphs, e.g., co-graphs. Although dynamic programming algorithms and logic-based characterizations are available for clique-width, little is known about encodings.
In this work, we initiate the quest to understand encoding capabilities with clique-width by considering abstract argumentation, which is a robust knowledge representation framework widely used for reasoning with conflicting arguments. It is based on directed graphs and asks for computationally challenging properties, making it a natural candidate to study computational properties. We design novel reductions from argumentation problems to (Q)SAT. Our reductions linearly preserve the clique-width, resulting in directed decomposition-guided (DDG) reductions. Thereby, we establish novel results for all argumentation semantics, including counting. We show that the overhead caused by our DDG reductions cannot be significantly improved under reasonable assumptions thereby providing a structurally optimal reduction contributing to the encoding theory of argumentation.

Structure-Aware Encodings of Argumentation Properties for Clique-width

Reconstructing complete and interactive 3D scenes remains a fundamental challenge in computer vision and robotics, particularly due to persistent object occlusions and limited sensor coverage. Even multi-view observations from a single scene scan often fail to capture the full structural details. Existing approaches typically rely on multi-stage pipelines—such as segmentation, background completion, and inpainting—or require per-object dense scanning, both of which are error-prone, and not easily scalable. We propose IGFuse, a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans, where natural object rearrangement between captures reveal previously occluded regions. Our method constructs segmentation-aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans. To handle spatial misalignments, we introduce a pseudo-intermediate scene state for symmetric alignment, alongside collaborative co-pruning strategies to refine geometry. IGFuse enables high-fidelity rendering and object-level scene manipulation without dense observations or complex pipelines. Extensive experiments validate the framework’s strong generalization to novel scene configurations, demonstrating its effectiveness for real-world 3D reconstruction and real-to-simulation transfer.

IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion

This paper initiates the integration of Spiking Neural Networks (SNNs) into WiFi-based indoor sensing, aiming to enhance performance in challenging signal environments. WiFi signal-based pattern recognition enables a wide range of applications, including human activity recognition, gait analysis for human identification, fine-grained gesture recognition and etc. However, unlike cameras, radar, or LiDAR signals, WiFi measurements are particularly susceptible to noise due to the ubiquitous and interference-prone nature of indoor wireless communication. 
Biologically inspired SNNs, like the human brain, excel at processing information in noisy environments by leveraging stochastic neural dynamics. 
To address this, we propose a hybrid architecture that combines conventional Artificial Neural Networks (ANNs) with SNNs, leveraging the noise-resilient properties of spiking neurons. Our method demonstrates improved accuracy and faster convergence during training. To support this claim, we present a theoretical analysis comparing the noise-handling capabilities of ANN and SNN models in WiFi scenarios. Extensive experiments across three representative WiFi sensing tasks validate the effectiveness and robustness of the proposed ANN-SNN hybrid architecture. For reproducibility, we will release the code upon acceptance.

Spiking-Aided Neural Architecture for Efficient and Robust WiFi Sensing

Subgraph matching, a cornerstone of relational pattern detection in domains ranging from biochemical systems to social network analysis, faces significant computational challenges due to the dramatically growing search space. Existing methods address this problem within a filtering-ordering-enumeration framework, in which the enumeration stage recursively matches the query graph against the candidate subgraphs of the data graph. However, the lack of awareness of subgraph structural patterns leads to a costly brute-force enumeration, thereby critically motivating the need for intelligent navigation in subgraph matching. To address this challenge, we propose Neural Graph Navigation (NeuGN), a neuro-heuristic framework that transforms brute-force enumeration into neural-guided search by integrating neural navigation mechanisms into the core enumeration process. By preserving heuristic-based completeness guarantees while incorporating neural intelligence, NeuGN significantly reduces the First Match Steps by up to 98.2\% compared to state-of-the-art methods across six real-world datasets.

Neural Graph Navigation for Intelligent Subgraph Matching

In this paper, we argue that current AI (alignment) research operates on a spectrum between two different underlying conceptions of intelligence: Intelligence Realism, which holds that intelligence represents a single, universal capacity measurable across all systems, and Intelligence Pluralism, which views intelligence as diverse, context-dependent capacities that cannot be reduced to a single universal measure. Through an analysis of current debates in AI research, we demonstrate how the conceptions remain largely implicit yet fundamentally shape how empirical evidence gets interpreted across a wide range of areas. More significantly, the underlying views generate fundamentally different research strands across three areas. Methodologically, they produce different approaches to model selection, benchmark design, and experimental validation. Interpretively, they lead to contradictory readings of scaling laws and system limitations. Regarding AI risk, they generate categorically different assessments of risk and alignment approaches: the ones viewing superintelligence as the biggest risk and searching for unified alignment solutions, the others seeing different threats in many different domains and searching for context-specific solutions. We argue that making explicit these underlying assumptions can contribute to a clearer understanding of the disagreements in this research space and, potentially, a more context-sensitive approach to alignment research.

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research

Federated learning (FL) enables privacy-preserving model training across distributed Electronic Health Records (EHRs), but its deployment remains limited by data-view heterogeneity, where institutions maintain incompatible local schemas. Most existing methods address this by enforcing flat, aligned data views, which require extensive cross-site preprocessing and manual harmonisation that often discards client-specific features, or by projecting inputs into a shared latent space, which sacrifices interpretability. We propose a modelling shift from conventional FL with vectorised inputs to a symbolic, relation-centric framework, where each client organises its EHR data as a structured, type-aware relational graph. This enables client-specific inference without requiring schema alignment and supports FL across heterogeneous data views. To model over these symbolic structures, we introduce an architecture that combines relation-aware message passing with a learnable feature relevance mechanism, jointly enabling accurate local predictions and client-specific interpretability while supporting parameter sharing across clients. Beyond strong performance on three real-world EHR datasets exhibiting data-view heterogeneity, we further show that our framework supports multimodal FL under modality-level heterogeneity. Using MC-MED, a publicly available multimodal emergency department dataset, we demonstrate that our method accommodates clients with partially missing modalities, highlighting its robustness and scalability in real-world clinical settings.

Neuro-Symbolic Federated Learning over Heterogeneous Data-Views: A Structured Approach to Distributive EHR Modelling

Optimization modeling plays a critical role in supporting optimal decision-making across various domains. Previous works have demonstrated that large language models (LLMs) tailored for optimization modeling have significantly automated and simplified this process. However, these models typically employ a straightforward input-output paradigm and struggle with challenging instances. In contrast, recent advances in general-purpose reasoning LLMs (RLLMs), such as DeepSeek-R1, have shown impressive capabilities in complex domains like mathematics and coding. In this paper, we introduce DeepOR, the first RLLM specifically designed for optimization modeling. Instead of directly outputting solutions, DeepOR explicitly performs multiple intermediate reasoning steps. To adapt a base LLM into an RLLM, we begin by synthesizing long chain-of-thought (CoT) data guided by a flowchart, which is automatically generated using a self-exploration algorithm. Once the training data are prepared, we employ supervised fine-tuning on the base LLM to endow it with reasoning capabilities tailored for optimization modeling. To fully leverage the model's reasoning potential, we further apply reinforcement learning with reward-shaping derived from solver feedback. Experimental results on benchmarks confirm that DeepOR consistently and significantly outperforms existing state-of-the-art approaches.

DeepOR: A Deep Reasoning Foundation Model for Optimization Modeling

Fluid–structure interaction (FSI) systems involve distinct physical domains, fluid and solid, governed by different partial differential equations and coupled at a dynamic interface. While learning-based solvers offer a promising alternative to costly numerical simulations, existing methods struggle to capture the heterogeneous dynamics of FSI within a unified framework. This challenge is further exacerbated by inconsistencies in response across domains due to interface coupling and by disparities in learning difficulty across fluid and solid regions, leading to instability during prediction. To address these challenges, we propose the Heterogeneous Graph Attention Solver (HGATSolver). HGATSolver encodes the system as a heterogeneous graph, embedding physical structure directly into the model via distinct node and edge types for fluid, solid, and interface regions. This enables specialized message-passing mechanisms tailored to each physical domain. To stabilize explicit time stepping, we introduce a novel physics-conditioned gating mechanism that serves as a learnable, adaptive relaxation factor. Furthermore, an Inter-domain Gradient-Balancing Loss dynamically balances the optimization objectives across domains based on predictive uncertainty. Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate that HGATSolver achieves state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.

Downloads

Next from AAAI 2026

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads