Singapore

Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to clearly measure localized audio relevance. With the effectiveness of explicit sound source modeling, SS2A achieves state-of-the-art performance in extensive image-to-audio tasks. We also qualitatively demonstrate SSV2A&#39;s ability to achieve intuitive synthesis control by compositing vision, text, and audio conditions. Furthermore, we show that our sound source modeling can achieve competitive video-to-audio performance with a straightforward temporal aggregation mechanism.

AAAI 2026

Gotta Hear Them All: Towards Sound Source Aware Audio Generation

vision to audio generation

audio synthesis

multimodal learning

Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to clearly measure localized audio relevance. With the effectiveness of explicit sound source modeling, SS2A achieves state-of-the-art performance in extensive image-to-audio tasks. We also qualitatively demonstrate SSV2A's ability to achieve intuitive synthesis control by compositing vision, text, and audio conditions. Furthermore, we show that our sound source modeling can achieve competitive video-to-audio performance with a straightforward temporal aggregation mechanism.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Memory behavior modeling seeks to predict individual recall performance and understand its underlying cognitive mechanisms. However, the dynamic and heterogeneous nature of memory data poses significant challenges to the generalization ability of models under unseen conditions. To address this challenge, we propose an invariant representation learning framework I-Mem that integrates self-supervised contrastive learning with decorrelation constraints, enabling the adaptive identification and suppression of environment-related factors in sequential behavioral data, thereby mitigating the influence of spurious features and enhancing the modeling of stable cognitive structures. Importantly, the method does not rely on explicit environment partitioning or predefined environment labels, while our theoretical analysis demonstrates that it can effectively resist environmental perturbations and facilitate the extraction of invariant structural representations, thereby ensuring adaptability and generalization. Empirical evaluations on both synthetic and real-world datasets further confirm its superiority over mainstream methods in terms of generalization performance and stable feature identification. Feature attribution analysis reveals that I-Mem extracts invariant features aligned with classical cognitive effects, and reflects short-term behavioral patterns that may indicate latent cognitive mechanisms beyond existing theories, highlighting both interpretability and discovery potential.

Invariant Representation Learning for Memory Behavior Modeling via Adaptive Environment Separation

Neural Radiance Fields (NeRF)-based Visual Simultaneous Localization and Mapping (SLAM) achieve superior scene geometric modeling and robust camera tracking by leveraging neural representations. 
Existing methods typically relied on multi-resolution hash encoding with truncated signed distance fields (TSDF) to achieve high frame rates. However, unavoidable hash collisions can lead to artifacts, and multi-view color inconsistencies in indoor scenes can result in shape-radiance ambiguity, adversely affecting geometric quality and tracking accuracy.
To address these issues, we propose a novel Multi-scale Hybrid Encoding-based Decoupled SLAM (MHED-SLAM). 
First, to mitigate the adverse effects of hash collisions and reduce the number of learnable parameters, we innovatively fuse a coarse-scale hash tri-plane with a fine-scale hash grid within a single latent volume. 
Second, to enable precise geometric reconstruction and camera tracking, we decouple the reconstruction and rendering processes, independently learning a TSDF field for reconstruction and a density field for rendering.
Third, we devise a Symmetric Kullback-Leibler (SKL) strategy based on ray termination distributions to align the probability distributions derived from the TSDF and density fields for their synchronous convergence. 
Extensive experimental evaluations demonstrate that our approach surpasses the state-of-the-art (SOTA) methods by utilizing a faster frame rate of 20 Hz and fewer parameters, while achieving higher tracking and reconstruction accuracy.

MHED-SLAM: Multi-Scale Hybrid Encoding-Based Decoupled SLAM

Graph contrastive learning (GCL) aims to learn representations by bringing semantically similar graphs closer and pushing dissimilar ones farther apart without label supervision. Hard negatives, which refer to graphs that have different labels but similar embeddings to the target graph, play a key role in improving representation discrimination. However, current methods that generate both high-quality positives and hard negatives face two challenges:
(1) Hard negative sample generation often suffers from class imbalance, resulting in unequal attention across classes and reduced discriminative power in the learned representations.
(2) The typical binary positive sample generation approach, which divides the graph into important and unimportant semantic regions, overlooks regions that negatively impact semantics and mislead model predictions. To address these issues, we introduce a novel method named BalanceGCL, which enhance graph contrastive learning with balanced hard negatives and fine-grained semantic-aware positives. BalanceGCL comprises two modules: Balanced Hard Negative graphs generation (BHN) and Fine-grained Semantic-aware Positive graphs generation (FSP). Inspired by the counterfactual mechanism, BHN generates balanced hard negatives that remain structurally similar to the original graph while inducing a controlled semantic shift. To ensure class balance, BHN iteratively constructs one hard negative sample for each class, ensuring an even distribution of negative samples across all alternative categories. FSP leverages the semantic differences between original graphs and balanced hard negatives to identify positively contributing, negatively contributing, and unimportant regions. By enhancing the influence of positive contributors, suppressing negative ones, and perturbing unimportant areas, it generates more reliable and semantically complete positive samples. The proposed method outperforms state-of-the-art GCL techniques across 14 datasets in graph classification and transfer learning tasks, demonstrating its effectiveness in tackling class imbalance and identifying fine-grained semantic-aware regions.

Graph Contrastive Learning with Balanced Hard Negatives and Fine-grained Semantic-aware Positives

Accurate and efficient depth estimation from time-of-flight (ToF) LiDAR is essential for autonomous systems operating in real-world environments. However, traditional histogram-based depth estimation (HBDE) algorithms face fundamental limitations in balancing depth performance and computational cost, and they struggle under signal-induced pile-up distortion. While deep learning has shown promise, existing neural network-based methods rely on large models that are impractical for deployment on edge hardware. To bridge this critical gap, we propose a paradigm shift in histogram-based ToF estimation, reframing depth estimation from signal filtering to lightweight similarity learning. Instead of attempting to correct the distorted signal, our approach learns a specialized metric where the measure of similarity between the distorted histogram and a reference pulse is the temporal shift itself. The resulting 57.61 KB model, over 215.2 $\times$ smaller than state-of-the-art deep learning approaches, achieves real-time performance (106.27 fps) on an FPGA. It delivers superior accuracy across nearly all signal-noise conditions, including 2.21 cm RMSE at severe pile-up scenarios, significantly outperforming conventional methods while remaining practical for on-device deployment—a feat unattainable by prior large-scale deep learning models.

A Paradigm Shift in High-Resolution Depth Estimation Using SPAD-Based LiDAR Histograms: From Signal Filtering to Lightweight Similarity Learning

Infrared small target detection is challenging due to limited target size and low signal-to-noise ratio. Unlike common targets, infrared small targets contain a higher proportion of edge pixels and exhibit blurred boundaries due to diffraction and quantization artifacts, making boundaries uniquely valuable cues for target perception. However, existing methods often emphasize holistic modeling while underutilizing such informative boundary cues. Motivated by this observation, we propose a Dual-Path Edge-Guided Frequency-Aware Network (DEFANet), which enables edge-target collaborative modeling for enhanced feature representation. DEFANet features a dual-path design, consisting of a main branch for holistic target modeling and an edge branch for boundary transition perception. To facilitate interaction and enhance representation in both branches, we introduce two core modules: Frequency-Aware Dual Enhancement Module (FADE) and Edge-Guided Integration Module (EGI). FADE employs a Frequency-Decoupled Attention Enhancement Mechanism to enhance both branches in the frequency domain, strengthening holistic modeling in the main branch and boundary representation in the edge branch. EGI leverages a Dual-Path Group-Wise Guidance Mechanism to integrate enhanced edge features into the main branch, improving boundary perception. Extensive experiments on four public infrared small target datasets, MDvsFA, LAFT, SIRST, and SIATD, demonstrate that DEFANet achieves SOTA performance. Ablation studies further validate the effectiveness of DEFANet and the soundness of its design motivation.

DEFANet: Dual-Path Edge-Target Collaboration with Frequency-Aware Enhancement for Infrared Small Target Detection

Neural solvers have demonstrated remarkable success in combinatorial optimization, often surpassing traditional heuristics in speed, solution quality, and generalization. However, their efficacy deteriorates significantly when confronted with complex constraints that cannot be effectively managed through simple masking mechanisms. To address this limitation, we introduce Universal Constrained Preference Optimization (UCPO), a novel plug-and-play framework that seamlessly integrates preference learning into existing neural solvers via a specially designed loss function, without requiring architectural modifications. UCPO embeds constraint satisfaction directly into a preference-based objective, eliminating the need for meticulous hyperparameter tuning. Leveraging a lightweight warm-start fine-tuning protocol, UCPO enables pre-trained models to consistently produce near-optimal, feasible solutions on challenging constraint-laden tasks, achieving exceptional performance with as little as 1\% of the original training budget.

UCPO: A Universal Constrained Combinatorial Optimization Method via Preference Optimization

Text-to-image (T2I) models have raised increasing safety concerns due to their capacity to generate NSFW and other banned objects. To mitigate these risks, safety filters and concept removal techniques have been introduced to block inappropriate prompts or erase sensitive concepts from the models. However, all the existing defense methods are not well prepared to handle diverse adversarial prompts. In this work, we introduce MacPrompt, a novel black-box and cross-lingual attack that reveals previously overlooked vulnerabilities in T2I safety mechanisms. Unlike existing attacks that rely on synonym substitution or prompt obfuscation, MacPrompt constructs macaronic adversarial prompts by performing cross-lingual character-level recombination of harmful terms, enabling fine-grained control over both semantics and appearance. By leveraging this design, MacPrompt crafts prompts with high semantic similarity to the original harmful inputs (up to 0.96) while bypassing major safety filters (up to 100\%). More critically, it achieves attack success rates as high as 92\% for sex-related content and 90\% for violence, effectively breaking even state-of-the-art concept removal defenses. These results underscore the pressing need to reassess the robustness of existing T2I safety mechanisms against linguistically diverse and fine-grained adversarial strategies.
\textbf{Warning: Content Warning: This paper includes sensitive examples (e.g., adult, violent, or illegal content). Unsafe images are masked but may still be disturbing.}

MacPrompt: Maraconic-Guided Jailbreak Against Text-to-Image Models

Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries.
To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs' query-generation and error-correction capabilities to small-scale LLM.
Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.

Self-Correction Distillation for Structured Data Question Answering

Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing **micro-expert** as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present **CAMERA**, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose **CAMERA-P**, a structured micro-expert pruning framework, and **CAMERA-Q**, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that **CAMERA-P** consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, **CAMERA-Q** achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Intraoperative hypotension (IOH) poses significant surgical risks, but accurate prediction remains challenging due to patient-specific variability. While test-time adaptation (TTA) offers a promising approach for personalized prediction, the rarity of IOH events often leads to unreliable test-time training. To address this, we propose CSA-TTA, a novel cross-sample augmented test-time adaptation framework that enhances training by incorporating hypotension events from other individuals. Specifically, we first construct a cross-sample bank by segmenting historical data into hypotensive and non-hypotensive samples. Then, we introduce a coarse-to-fine retrieval strategy for building test-time training data: we initially apply K-Shape clustering to identify representative cluster centers and subsequently retrieve the top-K semantically similar samples based on the current patient signal. Additionally, we integrate both self-supervised masked reconstruction and retrospective sequence forecasting signals during training to enhance model adaptability to rapid and subtle intraoperative dynamics. We evaluate the proposed CSA-TTA on both the VitalDB dataset and a real-world in-hospital dataset by integrating it with state-of-the-art time series forecasting models, including TimesFM and UniTS. CSA-TTA consistently enhances performance across settings—for instance, on VitalDB, it improves Recall and F1 scores by +1.33% and +1.13%, respectively, under fine-tuning, and by +5.07% and +7.46% in zero-shot scenarios—demonstrating strong robustness and generalization.

Content not yet available

Next from AAAI 2026

Invariant Representation Learning for Memory Behavior Modeling via Adaptive Environment Separation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES