Singapore

Wave‑like images—from attosecond streaking spectrograms to optical spectra, audio mel‑spectrograms and periodic video frames—encode critical harmonic structures that elude conventional feature extractors. We propose a unified, matrix‑equivalent framework that reinterprets convolution and attention as linear transforms on flattened inputs, revealing filter weights as basis vectors spanning latent feature subspaces. To infuse spectral priors we apply elementwise $\sin(\cdot)$ mappings to each weight matrix. Embedding these transforms into CNN, ViT and Capsule architectures yields Sin‑Basis Networks with heightened sensitivity to periodic motifs and built‑in invariance to spatial shifts. Experiments on a diverse collection of wave‑like image datasets—including 80,000 synthetic attosecond streaking spectrograms, thousands of Raman, photoluminescence and FTIR spectra, mel‑spectrograms from AudioSet and cycle‑pattern frames from Kinetics—demonstrate substantial gains in reconstruction accuracy, translational robustness and zero‑shot cross‑domain transfer. Theoretical analysis via matrix isomorphism and Mercer‑kernel truncation quantifies how sinusoidal reparametrization enriches expressivity while preserving stability in data‑scarce regimes. Sin‑Basis Networks thus offer a lightweight, physics‑informed approach to deep learning across all wave‑form imaging modalities.

AAAI 2026

SinBasis Networks: Matrix-Equivalent Feature Extraction for Wave-Like Optical Spectrograms

cv: learning & optimization for cv; ml: deep learning theory; cv: low level & physics-based vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Spiking Federated Learning (SFL) has been widely studied with the energy efficiency of Spiking Neural Networks (SNNs). However, existing SFL methods require model homogeneity and assume all clients have sufficient computational resources, resulting in the exclusion of some resource-constrained clients. To address the prevalent system heterogeneity in real-world scenarios, enabling heterogeneous SFL systems that allow clients to adaptively deploy models of different scales based on their local resources is crucial. To this end, we introduce SFedHIFI, a novel Spiking Federated Learning framework with Fire Rate-Based Heterogeneous Information Fusion. Specifically, SFedHIFI employs channel-wise matrix decomposition to deploy SNN models of adaptive complexity on clients with heterogeneous resources. Building on this, the proposed heterogeneous information fusion module enables cross-scale aggregation among models of different widths, thereby enhancing the utilization of diverse local knowledge. Extensive experiments on three public benchmarks demonstrate that SFedHIFI can effectively enable heterogeneous SFL, consistently outperforming all three baseline methods. Compared with ANN-based FL, it achieves significant energy savings with only a marginal trade-off in accuracy.

SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning

Large language models (LLMs) augmented with retrieval have shown impressive performance in open-domain question answering, yet struggle significantly with temporal knowledge graph question answering (TKGQA). The core issue lies in structural misalignment: treating structured, temporally sensitive graph queries as plain text often causes LLMs to retrieve or reason with semantically similar but structurally incorrect facts, resulting in critical inaccuracies. To address this, we introduce SAR (Structure-Aligned Reasoning), a novel TKGQA framework that integrates LLM reasoning tightly with the explicit subject–predicate–object–time schema inherent in knowledge graphs. SAR employs an LLM agent to first decompose natural language questions into structured queries, clearly delineating entities, relationships, and temporal constraints. It then conducts schema-consistent, time-aware retrieval from the knowledge graph to acquire candidate quadruples, which guide a subsequent iterative ReAct-style reasoning process by the LLM. A final verification stage ensures that proposed answers strictly adhere to temporal conditions, reinforcing accuracy and temporal coherence. Experiments on two benchmark datasets, MultiTQ and CronQuestions, demonstrate SAR’s effectiveness, achieving the best results. Specifically, with GPT-4.1, SAR achieves 78.2% Hits@1 on MultiTQ, significantly outperforming existing methods, and similarly establishes a new performance record on CronQuestions. Our results underscore the critical importance of structural alignment in temporal reasoning tasks, particularly in handling complex queries involving multiple temporal constraints and multi-hop reasoning.

SAR: A Structure-Aligned Reasoning Framework for Temporal Knowledge Graph Question Answering

Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box's text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.

Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection

Standardized microplates are used to conduct large-scale biomedical research. The design of microplate layouts plays an essential role in handling so-called plate effects, i.e., systematic variations across the geometry of a microplate. An effective layout allows us to detect and negate plate effects. The randomized placement of controls and compounds produces layouts of limited effectiveness, so specific approaches are needed. A previously developed system, PLAID, proposed a constraint satisfaction model to construct effective plate layouts. However, PLAID does not scale well with microplate dimensions. To improve on PLAID, we propose Constraint Optimization of MicroPlate Designs (COMPD), which allows for greater flexibility and higher quality of the layouts.

Constraint Optimization of MicroPlate Designs

Maximum satisfiability (MaxSAT) is a viable approach to solving NP-hard combinatorial optimization problems through propositional encodings.
Understanding how problem structure and encodings impact the behaviour of different MaxSAT solving algorithms is an important challenge.
In this work, we identify MaxSAT instances in which the constraints entail an ordering of the objective variables as an interesting instance class from the perspectives of problem structure and MaxSAT solving. From the problem structure perspective, we show that a non-negligible percentage of instances in commonly used MaxSAT benchmark sets have ordered objectives and further identify various examples of such problem domains to which MaxSAT solvers have been successfully applied. From the algorithmic perspective, we argue that MaxSAT instances with ordered objectives, provided an ordering, can be solved (at least) as efficiently with a very simplistic algorithmic approach as with modern core-based MaxSAT solving algorithms. We show empirically that state-of-the-art MaxSAT solvers suffer from overheads and are outperformed by the simplistic approach on real-world optimization problems with ordered objectives.

Ordered Objectives in Maximum Satisfiability

During the video encoding process, the original spatial domain signal is first transformed into the frequency domain, followed by quantization and compression. As a result, the quality degradation in compressed videos primarily stems from distortions in the frequency domain information. However, existing video enhancement methods typically directly fuse information from adjacent frames in the spatial domain, making it difficult for models to effectively compensate for frequency domain distortions, which leads to suboptimal detail restoration. To address this issue, we propose a Hierarchical Frequency-Guided Alignment Transformer. Additionally, by analyzing the characteristics of the frequency domain, we find that different frequency bands exhibit both correlations and a certain degree of independence. Based on this, we introduce a Frequency-Aware Transformer module that employs a combination of independent and mixed processing to optimize information exchange across different frequency domains, effectively mitigating cross-interference from irrelevant information. Experimental results demonstrate that, compared to existing methods, our approach achieves state-of-the-art performance in objective metrics (PSNR/SSIM), perceptual quality (LPIPS), and subjective visual effects, while reducing model complexity.

Hierarchical Frequency-Guided Alignment Transformer for Compressed Video Quality Enhancement

Privacy-preserving Transformer inference has gained attention due to the potential leakage of private information. Despite recent progress, existing frameworks still fall short of practical model scales, with gaps up to a hundredfold. A possible way to close this gap is the Mixture of Experts (MoE) architecture, which has emerged as a promising technique to scale up model capacity with minimal overhead. However, given that the current secure two-party (2-PC) protocols allow the server to homomorphically compute the FFN layer with its plaintext model weight, under the MoE setting, this could reveal which expert is activated to the server, exposing token-level privacy about the client's input. While naively evaluating all the experts before selection could protect privacy, it nullifies MoE sparsity and incurs the heavy computational overhead that sparse MoE seeks to avoid. 

To address the privacy and efficiency limitations above, we propose a 2-PC privacy-preserving inference framework, SecMoE. Unifying per-entry circuits in both the MoE layer and piecewise polynomial functions, SecMoE obliviously selects the extracted parameters from circuits and only computes one encrypted entry, which we refer to as Select-Then-Compute. This makes the model for private inference scale to 63$\times$ larger while only having a 15.2$\times$ increase in end-to-end runtime. Extensive experiments show that, under 5 expert settings, SecMoE lowers the end-to-end private inference communication by 1.8$\sim$7.1$\times$ and achieves 1.3$\sim$3.8$\times$ speedup compared to the state-of-the-art (SOTA) protocols.

SecMoE: Communication-Efficient Secure MoE Inference via Select-Then-Compute

While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. 
Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.

HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses Through Reasoning MLLMs

Due to the long-range modeling ability and linear complexity property, Mamba has attracted considerable attention in point cloud analysis. Despite some interesting progress, related work still suffers from imperfect point cloud serialization, insufficient high-level geometric perception, and overfitting of the selective state space model (S6) at the core of Mamba. To this end, we resort to an SSM-based point cloud network termed CloudMamba to address the above challenges. Specifically, we propose sequence expanding and sequence merging, where the former serializes points along each axis separately and the latter serves to fuse the corresponding higher-order features causally inferred from different sequences, enabling unordered point sets to adapt more stably to the causal nature of Mamba without parameters. Meanwhile, we design chainedMamba that chains the forward and backward processes in the parallel bidirectional Mamba, capturing high-level geometric information during scanning. In addition, we propose a grouped selective state space model (GS6) via parameter sharing on S6, alleviating the overfitting problem caused by the computational mode in S6. Experiments on various point cloud tasks validate CloudMamba's ability to achieve state-of-the-art results with significantly less complexity.

CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis

Recently, All-in-One image restoration (AIR) techniques have advanced significantly, offering promising solutions for complex real-world degradations. However, most existing approaches heavily rely on degradation-specific representation learning, which can lead to oversmoothing and artifacts in the restored images. To address this limitation, we propose ClearAIR, a novel AIR framework inspired by human visual perception and designed with a hierarchical restoration strategy in a coarse to fine manner. First, leveraging the global priority characteristic of early human visual perception, we employ an image quality assessment model to evaluate the overall image structure and degradation level. Next, to locate the local degradation areas, we introduce an attention-driven regional analysis process. By combining the Segment Anything Model, we achieve regional semantic positioning and use a task recognizer to identify the degradation patterns of the regions, thus realizing a detailed analysis of the local degradation situations. Finally, aiming at the challenge of local detail restoration, we propose an internal clue reuse mechanism. This mechanism deeply mines the internal information of the image in a self-supervised manner and enhances the model’s learning ability for local details. Experimental results demonstrate that ClearAIR achieves superior restoration results across diverse synthetic and real-world datasets.

Downloads

Next from AAAI 2026

SFedHIFI: Fire Rate-Based Heterogeneous Information Fusion for Spiking Federated Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES