Singapore

Understanding human actions in videos requires robust
integration of multimodal cues beyond raw pixels. This work
introduces a deep self-supervised action recognition
framework that jointly predicts action concepts and
auxiliary features from RGB video, then hallucinates
missing modalities at test time to improve recognition
without added runtime cost. Two new domain-specific
descriptors, Object Detection Features (ODF) and Saliency
Detection Features (SDF), are proposed to capture spatial
context and motion saliency, integrating them with other
modalities such as optical flow, skeleton, audio, and
improved dense trajectories. The framework incorporates
aleatoric uncertainty modeling to handle noisy or
unreliable features, along with a robust loss for stable
multimodal fusion. Compatible with popular architectures
including I3D, AssembleNet, Video Transformer Network,
VideoMAE V2, and InternVideo2, the approach achieves
state-of-the-art results on Kinetics-400, Kinetics-600, and
Something-Something V2.

AAAI 2026

Feature Hallucination for Self-supervised Action Recognition

domain-specific descriptor

feature hallucination

skeleton

fine-grained recognition

uncertainty modeling

audio

saliency detection

optical flow

action recognition

object detection

self-supervision

multimodal

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Jigsaw puzzle solving remains difficult because models must
reconcile local fragment cues with global structure. Most
prior work leans solely on visual signals (edge or texture
coherence) and rarely exploits natural-language
descriptions, which are especially helpful for puzzles with
eroded gaps. We introduce a vision–language framework that
uses textual context to guide assembly. At its core, the
Vision–Language Hierarchical Semantic Alignment (VLHSA)
module aligns image patches with text via multi-level
matching—from local tokens to global summaries—within a
multimodal design that couples dual visual encoders with
language features for cross-modal reasoning. Across
multiple datasets, the method surpasses the state of the
art, including a 14.2 percentage point gain in piece
accuracy; ablations identify VLHSA as the principal source
of improvement. These results suggest a practical shift for
jigsaw solving: augmenting vision with language to resolve
ambiguous placements

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps (Student Abstract)

Hand-craft reward engineering requires domain knowledge
with numerous trials and errors, while Preference-based
Reinforcement Learning (PbRL) avoids manual reward design
but often suffers from limited interpretability and
unstable training. To address these issues, we propose a
novel preference alignment framework. Our approach
leverages large language models to generate sub-reward
functions informed by prior knowledge and further align
human preferences by optimizing the weights combining these
sub-rewards. For policy learning, we introduce Policy
Optimization via Pareto Regularization (POPR) which
regularizes updates along Pareto-optimal directions.
Experiments show that our framework improves reward quality
and policy stability, achieving superior performance to
expert-designed rewards across most tasks.

Efficient Preference Alignment via Pareto Exploration (Student Abstract)

We present AniTales, a system designed to generate multimodal visual novels from natural language prompts. Our system integrates large language models for story generation, diffusion models for character art, and text-to-speech for voice acting. This paper describes the system's architecture and presents findings from a pilot user study. We evaluated the system with general users (n=10) and domain experts (n=5), focusing on usability, coherence, and visual consistency. General users reported high usability (SUS: 84/100) and strong character-dialogue consistency (4.2/5), along with an average score of 82/100 for their intention to continue using the platform. These initial results suggest AniTales is a promising approach for bridging the gap between text-based AI storytelling and end-to-end multimedia content creation.

AniTales: End-to-End Multimodal Story Generation Through Natural Language Prompting (Student Abstract)

We propose—somewhat tongue-in-cheek, yet with serious
implications—a new test for artificial intelligence: the
ability to watch a 90-minute episode of the long-running
German crime drama Tatort, and to explain every relevant
detail. This involves reconstructing the evolving social
network of characters, identifying their beliefs, desires,
and intentions, and, crucially, determining who committed
the crime. We argue
that this task integrates narrative understanding,
common-sense reasoning, social cognition, and theory of
mind—and thus provides a uniquely challenging benchmark for
AI.

The Tatort Test of Intelligence: Towards Narrative Comprehension as a Benchmark for AI

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD).
Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that our methodology achieves state-of-the-art performance in both classification and segmentation tasks.

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

Self-supervised Multiplex Consensus Mamba for General Image Fusion

Spiking Neural Networks(SNNs) are a promising paradigm designed to emulate the brain's energy efficient by incorporating the timing of spikes. Conversion is an efficient way to obtain high-performance SNNs from Artificial Neural Networks(ANNs). Existing conversion methods often face a trade-off between accuracy and time steps, which is largely caused by the incomplete release of residual membrane potentials. To minimize the conversion error, this paper proposed a harmonious mathematical property-based neuron, called Harmony Multi-Threshold Neurons (H-MT Neuron), which utilizes multiple spikes to minimize residual membrane potentials. The proposed neuron is further enhanced with an optional effective communication mechanism to achieve more accurate conversion. In addition, we propose a threshold optimization method applicable to a broader range cases of spiking neurons to to find the optimal neuron thresholds. Experiment results demonstrate that our method achieve superior accuracy on ImageNet benchmark datasets while significantly reducing the required time steps and energy consumption.

Generalized Threshold Optimization with Harmony Multi-Threshold Neurons for Accurate ANN-to-SNN Conversion

Nighttime flares, caused by complex scattering and reflections from artificial light sources, significantly degrade image quality and hinder downstream visual tasks. Existing deflare networks usually struggle to jointly capture and fuse latent spatial and frequency features. In this paper, we propose a novel Wavelet-guided and Gated-enhanced Spatial-frequency Fusion Network (WGSF-Net) for nighttime flare removal. WGSF-Net is primarily composed of two key modules: Wavelet-guided Fusion Block (WFB) and Local-Global Block (LGB). Specifically, WFB integrates a Multi-level Wavelet Enhancement Block (MWEB) and a Spatial-Frequency Fusion Network (SFFN) to effectively extract hierarchical spatial and frequency features through a coarse-to-fine strategy based on multi-level wavelet decomposition. To better suppress flare artifacts, LGB is designed to jointly capture local and global information: a Gated-Enhanced Attention Block (GEAB) selectively amplifies critical local features using channel-shuffle convolutions and a difference network, and the subsequent SFFN performs global spatial-frequency fusion via partial Fourier convolution and depthwise separable convolution. This design enables LGB to effectively disentangle flare-corrupted regions and restore fine-grained details, making it particularly suited for challenging real-world deflare scenarios. Extensive experiments on both synthetic and real datasets show that WGSF-Net achieves state-of-the-art performance in nighttime flare removal, outperforming existing methods across five evaluation metrics.

Nighttime Flare Removal via Wavelet-Guided and Gated-Enhanced Spatial-Frequency Fusion Network

Sequential recommendation (SR) aims to predict a user's next item preference by modeling historical interaction sequences. Recent advances often integrate frequency-domain modules to compensate for self-attention's low-pass nature by restoring the high-frequency signals critical for personalized recommendations. Nevertheless, existing frequency-aware solutions process each session in isolation and optimize exclusively with time-domain objectives. Consequently, they overlook cross-session spectral dependencies and fail to enforce alignment between predicted and actual spectral signatures, leaving valuable frequency information under-exploited. To this end, we propose FreqRec, a Frequency-Enhanced Dual-Path Network for sequential Recommendation that jointly captures inter-session and intra-session behaviors via a learnable Frequency-domain Multi-layer Perceptrons. Moreover, FreqRec is optimized under a composite objective that combines cross entropy with a frequency-domain consistency loss, explicitly aligning predicted and true spectral signatures. Extensive experiments on three benchmarks show that FreqRec surpasses strong baselines and remains robust under data sparsity and noisy-log conditions.

Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation

Competency Questions (CQs) play a crucial role in validating ontology design. While manually crafting CQs can be highly time-consuming and costly for ontology engineers, recent studies have explored the use of large language models (LLMs) to automate this process. However, prior approaches have largely evaluated generated CQs based on their similarity to existing datasets, which often fail to verify semantic pitfalls such as “Misusing allValuesFrom”. Since such pitfalls cannot be reliably detected through rule-based methods, we propose a novel dataset and model of Validating Semantic Pitfalls in Ontology (VSPO) for CQ generation specifically designed to verify the semantic pitfalls. To simulate missing and misused axioms, we use LLM to generate natural language definitions of classes and properties and introduce misalignments between the definitions and the ontology by removing axioms or altering logical operators (e.g., substituting union with intersection). We then fine-tune LLaMA-3.1-8B-Instruct to generate CQs that validate these semantic discrepancies between the provided definitions and the corresponding axioms. The resulting CQs can detect a broader range of modeling errors compared to existing public datasets. Our fine-tuned model demonstrates superior performance over baselines, showing 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation. This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge. To the best of our knowledge, this is the first study to target semantic pitfall validation in CQ generation using LLMs.

Downloads

Next from AAAI 2026

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps (Student Abstract)

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES