Singapore

Jigsaw puzzle solving remains difficult because models must
reconcile local fragment cues with global structure. Most
prior work leans solely on visual signals (edge or texture
coherence) and rarely exploits natural-language
descriptions, which are especially helpful for puzzles with
eroded gaps. We introduce a vision–language framework that
uses textual context to guide assembly. At its core, the
Vision–Language Hierarchical Semantic Alignment (VLHSA)
module aligns image patches with text via multi-level
matching—from local tokens to global summaries—within a
multimodal design that couples dual visual encoders with
language features for cross-modal reasoning. Across
multiple datasets, the method surpasses the state of the
art, including a 14.2 percentage point gain in piece
accuracy; ablations identify VLHSA as the principal source
of improvement. These results suggest a practical shift for
jigsaw solving: augmenting vision with language to resolve
ambiguous placements

AAAI 2026

VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps (Student Abstract)

hierarchical semantic alignment

vision–language alignment

jigsaw puzzle solving

multi-modal vision

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We propose—somewhat tongue-in-cheek, yet with serious
implications—a new test for artificial intelligence: the
ability to watch a 90-minute episode of the long-running
German crime drama Tatort, and to explain every relevant
detail. This involves reconstructing the evolving social
network of characters, identifying their beliefs, desires,
and intentions, and, crucially, determining who committed
the crime. We argue
that this task integrates narrative understanding,
common-sense reasoning, social cognition, and theory of
mind—and thus provides a uniquely challenging benchmark for
AI.

The Tatort Test of Intelligence: Towards Narrative Comprehension as a Benchmark for AI

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the the encoding characteristics of neuron activation in neural networks, we propose the Mixture-of-Activation-Sparsity-Experts (MoASE) for the CTTA task. Given the distinct reaction of neurons with low and high activation to domain-specific and agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components in each expert with a Spatial Differentiable Dropout (SDD).
Based on the decomposition, we devise a Domain-Aware Router (DAR) that utilizes domain information to adaptively weight experts that process the post-SDD sparse activations, and the Activation Sparsity Gate (ASG) that adaptively assigns feature selection thresholds of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to maintain update consistency between the teacher and student experts to prevent error accumulation. Extensive experiments substantiate that our methodology achieves state-of-the-art performance in both classification and segmentation tasks.

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.

Self-supervised Multiplex Consensus Mamba for General Image Fusion

Spiking Neural Networks(SNNs) are a promising paradigm designed to emulate the brain's energy efficient by incorporating the timing of spikes. Conversion is an efficient way to obtain high-performance SNNs from Artificial Neural Networks(ANNs). Existing conversion methods often face a trade-off between accuracy and time steps, which is largely caused by the incomplete release of residual membrane potentials. To minimize the conversion error, this paper proposed a harmonious mathematical property-based neuron, called Harmony Multi-Threshold Neurons (H-MT Neuron), which utilizes multiple spikes to minimize residual membrane potentials. The proposed neuron is further enhanced with an optional effective communication mechanism to achieve more accurate conversion. In addition, we propose a threshold optimization method applicable to a broader range cases of spiking neurons to to find the optimal neuron thresholds. Experiment results demonstrate that our method achieve superior accuracy on ImageNet benchmark datasets while significantly reducing the required time steps and energy consumption.

Generalized Threshold Optimization with Harmony Multi-Threshold Neurons for Accurate ANN-to-SNN Conversion

Nighttime flares, caused by complex scattering and reflections from artificial light sources, significantly degrade image quality and hinder downstream visual tasks. Existing deflare networks usually struggle to jointly capture and fuse latent spatial and frequency features. In this paper, we propose a novel Wavelet-guided and Gated-enhanced Spatial-frequency Fusion Network (WGSF-Net) for nighttime flare removal. WGSF-Net is primarily composed of two key modules: Wavelet-guided Fusion Block (WFB) and Local-Global Block (LGB). Specifically, WFB integrates a Multi-level Wavelet Enhancement Block (MWEB) and a Spatial-Frequency Fusion Network (SFFN) to effectively extract hierarchical spatial and frequency features through a coarse-to-fine strategy based on multi-level wavelet decomposition. To better suppress flare artifacts, LGB is designed to jointly capture local and global information: a Gated-Enhanced Attention Block (GEAB) selectively amplifies critical local features using channel-shuffle convolutions and a difference network, and the subsequent SFFN performs global spatial-frequency fusion via partial Fourier convolution and depthwise separable convolution. This design enables LGB to effectively disentangle flare-corrupted regions and restore fine-grained details, making it particularly suited for challenging real-world deflare scenarios. Extensive experiments on both synthetic and real datasets show that WGSF-Net achieves state-of-the-art performance in nighttime flare removal, outperforming existing methods across five evaluation metrics.

Nighttime Flare Removal via Wavelet-Guided and Gated-Enhanced Spatial-Frequency Fusion Network

Sequential recommendation (SR) aims to predict a user's next item preference by modeling historical interaction sequences. Recent advances often integrate frequency-domain modules to compensate for self-attention's low-pass nature by restoring the high-frequency signals critical for personalized recommendations. Nevertheless, existing frequency-aware solutions process each session in isolation and optimize exclusively with time-domain objectives. Consequently, they overlook cross-session spectral dependencies and fail to enforce alignment between predicted and actual spectral signatures, leaving valuable frequency information under-exploited. To this end, we propose FreqRec, a Frequency-Enhanced Dual-Path Network for sequential Recommendation that jointly captures inter-session and intra-session behaviors via a learnable Frequency-domain Multi-layer Perceptrons. Moreover, FreqRec is optimized under a composite objective that combines cross entropy with a frequency-domain consistency loss, explicitly aligning predicted and true spectral signatures. Extensive experiments on three benchmarks show that FreqRec surpasses strong baselines and remains robust under data sparsity and noisy-log conditions.

Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation

Competency Questions (CQs) play a crucial role in validating ontology design. While manually crafting CQs can be highly time-consuming and costly for ontology engineers, recent studies have explored the use of large language models (LLMs) to automate this process. However, prior approaches have largely evaluated generated CQs based on their similarity to existing datasets, which often fail to verify semantic pitfalls such as “Misusing allValuesFrom”. Since such pitfalls cannot be reliably detected through rule-based methods, we propose a novel dataset and model of Validating Semantic Pitfalls in Ontology (VSPO) for CQ generation specifically designed to verify the semantic pitfalls. To simulate missing and misused axioms, we use LLM to generate natural language definitions of classes and properties and introduce misalignments between the definitions and the ontology by removing axioms or altering logical operators (e.g., substituting union with intersection). We then fine-tune LLaMA-3.1-8B-Instruct to generate CQs that validate these semantic discrepancies between the provided definitions and the corresponding axioms. The resulting CQs can detect a broader range of modeling errors compared to existing public datasets. Our fine-tuned model demonstrates superior performance over baselines, showing 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation. This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge. To the best of our knowledge, this is the first study to target semantic pitfall validation in CQ generation using LLMs.

VSPO: Validating Semantic Pitfalls in Ontology via LLM-Based CQ Generation

Large-scale pre-trained vision-language models (VLMs) like CLIP show exceptional performance and zero-shot generalization. However, their reliability may be severely undermined by a critical vulnerability to subtle adversarial perturbations. Our work reveals a critical cross-modal vulnerability: visual-only perturbations induce substantial, synchronous shifts in decision attribution maps across both image and text. This phenomenon signifies a fundamental disruption of the VLM's internal logic, as it alters both the model's perceptual focus and its decision rationale. To counter this vulnerability, we introduce Cross-modal Bidirectional Attribution guided Few-shot Adversarial Prompt Tuning (CBA-FAPT), a novel method that leverages the model's internal decision rationale as a regularizer for robust learning. Our framework's core mechanism is the alignment of a novel bidirectional attribution map. This map is a unique fusion of two components. It combines forward feature attention to capture the model's perceptual focus. It also incorporates backward decision gradients to act as a proxy for the model's decision rationale, quantifying how each feature influences the final outcome. We enforce consistency on this bidirectional map between clean and adversarial examples. This approach corrects the model's internal logic on two fronts and effectively restores its adversarial robustness. Comprehensive experiments on 11 datasets demonstrate that CBA-FAPT outperforms the state-of-the-art, establishing a superior trade-off between robust and natural accuracy.

Stabilizing Cross-Modal Bidirectional Attribution: Few-Shot Adversarial Prompt Tuning for Robust Vision-Language Models

We study the problem of \emph{Constrained Online Convex Optimization with Memory} (\texttt{COCO‑M}), where both the loss and constraints depend on a window of $m$ past decisions of the learner ($m$ memory length). This setting extends the previously-studied unconstrained OCO-M framework and captures a range of practical problems arising, for instance, in the control of constrained dynamical systems and in scheduling problems with reconfiguration budgets, among others. For this new problem, we propose the first algorithms that achieve sublinear regret and cumulative constraint‑violation (CCV) over time‑varying constraints, both with and without predictions about the missing loss and constraint functions. In the absence of predictions, our approach achieves regret $\mathcal{O}(m^{3/2} \sqrt{T \log T})$ and CCV $\mathcal{O}\big(T^{3/4}\vee\! m^{3/2} \sqrt{T \log T}\big)$ using an intuitive adaptive penalty method. When short-horizon ''untrusted'' predictions are available, we reinterpret the problem as an instance of online learning with delayed feedback, and design an optimistic algorithm with regret $\mathcal{O}\!\bigl(\sqrt{\mathcal{E}_T(f)}\,\vee\,\log T\bigr)$ and CCV $\mathcal{O}\!\bigl((\sqrt{\mathcal{E}_T(g)}+m)\log T\bigr)$. These bounds depend on the accumulated prediction errors of the losses and constraints ($\mathcal{E}_T(f)$ and $\mathcal{E}_T(g)$), and improve directly with the predictions accuracy while maintaining order-optimal convergence when they fail. Our results close the gap between classical COCO and memory-dependent COCO, and create a new versatile learning toolbox with diverse applications.

Constrained Online Convex Optimization with Memory and Predictions

We depart from Uncapacitated Facility Location and by assuming that the connection costs of agents to facilities are congestion dependent, we define a novel problem, namely, Facility Location for Congesting (Selfish) Commuters. The connection costs of agents to facilities come as a result of how the agents commute to reach the facilities in an underlying network with cost functions on the edges. Inapproximability results follow from the related literature and thus approximate solutions is all we can hope for. For when the cost functions are nondecreasing we employ in a novel way an approximate version of Caratheodory’s Theorem (Barman 2018) to show how approximate solutions for different versions of the problem can be derived. For when the cost functions are nonincreasing we show how this problem generalizes the Cost-Distance problem (Meyerson, Munagala, and Plotkin 2008) and provide an algorithm that for this more general case achieves the same approximation guarantees.

Downloads

Next from AAAI 2026

The Tatort Test of Intelligence: Towards Narrative Comprehension as a Benchmark for AI

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES