Singapore

Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate-scale pre-trained models and employ diverse optimization strategies to achieve promising performance under full-parameter fine-tuning. In fact, deeper and larger-scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full-parameter fine-tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large-scale models in existing works. In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task-specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task-specific information for subsequent fine-tuning. Furthermore, we construct an expert-guided conditional adapter that first enhances the fine-grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine-tuning. Extensive experiments show that our WEFT not only outperforms 21 state-of-the-art methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios.

AAAI 2026

Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation

optimical remote sensing image.

fine-tuning of large-scale models

object segmentation

computer vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel $\textbf{J}$oint $\textbf{F}$eature-$\textbf{D}$omain $\textbf{D}$eblurring and $\textbf{D}$etection end-to-end framework, dubbed $JFD^{3}$. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. We also construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that $JFD\^{3}$ achieves superior detection performance while maintaining real-time efficiency. Code and dataset are released at \textit{https://anonymous.4open.science/r/JFD3-76C1}.

Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection

Regional Adaptive Hierarchical Transform (RAHT) is an effective point cloud attribute compression (PCAC) method. However, its application in deep learning lacks research. This paper proposes an end-to-end RAHT framework for lossy PCAC based on the sparse tensor, called DeepRAHT. The RAHT transform is performed within the end-to-end reconstruction process, without requiring manual RAHT for pre-processing. We also introduce the predictive RAHT to reduce bitrates and design a learning-based prediction model to enhance the performance. Moreover, we devise a bitrate proxy that applies run-length coding to entropy coding, achieving seamless variable-rate coding and improving the robustness. DeepRAHT is a reversible and distortion-controllable framework, ensuring its lower bound performance and offering significant application potential. The experiments demonstrate that DeepRAHT is a high-performance, faster, and more robust framework than the baseline solutions.

DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression

Low-light object detection faces significant challenges due to the substantial domain shift between normal-light and low-light conditions. Prior works often enhance low-light images before detection, but this preprocessing can introduce artifacts that degrade detection performance since it focuses on human visual quality rather than task-specific features. Other methods incorporate illumination-aware modules for low-light feature learning, yet their scalability is limited by the scarcity of annotated low-light datasets. To overcome these limitations, we propose a unified Dual-Level Domain Adaptation (DLDA) framework that jointly addresses pixel-level and feature-level domain discrepancies for robust low-light object detection. Specifically, we introduce a luminance-aware contrastive translation module that synthesizes target-style low-light images while preserving structural details, enabling effective pixel-level adaptation. Building on this, we further design a multi-scale conditional adversarial alignment strategy that enforces semantic consistency across feature hierarchies to enhance domain-invariant feature extraction. Extensive experiments on multiple low-light detection benchmarks demonstrate that DLDA achieves state-of-the-art performance, exhibiting strong robustness and generalization.

DLDA: Unified Dual-Level Domain Adaptation for Low-Light Object Detection

Over-smoothing in Graph Neural Networks (GNNs) causes collapse in distinct node features, particularly on heterophilic graphs where adjacent nodes often have dissimilar labels. Although sheaf neural networks partially mitigate this problem, they typically rely on static or heavily parameterized sheaf structures that hinder generalization and scalability. Existing sheaf-based models either predefine restriction maps or introduce excessive complexity, yet fail to provide rigorous stability guarantees. In this paper, we introduce a novel scheme called SGPC (Sheaf GNNs with PAC-Bayes Calibration), a unified architecture that combines cellular-sheaf message passing with several mechanisms, including optimal transport-based lifting, variance-reduced diffusion, and PAC-Bayes spectral regularization for robust semi-supervised node classification. We establish performance bounds theoretically and demonstrate that end-to-end training in linear computational complexity can achieve the resulting bound-aware objective. Experiments on nine homophilic and heterophilic benchmarks show that SGPC outperforms state-of-the-art spectral and sheaf-based GNNs while providing certified confidence intervals on unseen nodes. The code and proofs are in https://github.com/ChoiYoonHyuk/SGPC.

Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization

The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint—that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.

Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass Control

Accurate prediction of breast cancer recurrence after treatment is essential for improving long-term outcomes. However, existing models are limited by three key challenges: (1) they typically rely on single-modal data, missing cross-modal interactions; (2) they analyze static snapshots, failing to capture disease progression over time; and (3) they often perform coarse feature fusion, lacking semantic disentanglement and interpretability. To address these issues, we propose LUMIN (Longitudinal Multi-modal Knowledge Decomposition Network), a novel framework that integrates longitudinal mammograms and electronic health records (EHRs) for recurrence prediction. LUMIN leverages a vision-language contrastive pretraining backbone to align multi-modal representations and introduces two knowledge extraction modules: (1) a Cross-Modal Disentangled Knowledge Extractor (CM-DKE) that separates shared, complementary, and modality-specific information across imaging and text; and (2) a Temporal Evolution Disentangled Knowledge Extractor (TE-DKE) that captures time-invariant, time-varying, and time-specific features to model disease dynamics. Experiments on a large-scale dataset of 3,924 patients and 19,684 exams show that LUMIN significantly outperforms state-of-the-art baselines, demonstrating its effectiveness in capturing both multi-modal semantics and temporal heterogeneity for recurrence prediction.

LUMIN: A Longitudinal Multi-modal Knowledge Decomposition Network for Predicting Breast Cancer Recurrence

Generating high-quality, controllable, and structurally consistent 3D scenes is a fundamental yet challenging task, especially in complex multi-object environments. We present \textbf{SceneGenesis}, a unified framework for 3D scene synthesis that systematically integrates semantic structural priors with mesh-guided video-geometry fusion. The process begins with a \textbf{semantic structural initialization module}, which leverages large language models to convert textual scene prompts into category-aware object descriptions. These are transformed into structured meshes by combining procedural approximations for large-scale objects and pretrained mesh generators for fine-grained assets, enabling precise layout control and scene scalability. To synthesize rich and style-controllable appearances, we render depth and semantic maps from the initialized scene and condition a pretrained video diffusion model to generate multi-view video sequences with geometry-awareness, where a consistency-guided latent fusion strategy further enhances temporal consistency across long sequences. Crucially, we introduce a \textbf{mesh-guided video-geometry fusion module} that reconstructs coherent 3D Gaussian scenes by aligning mesh priors with video outputs. This module incorporates mesh-conditioned fragment initialization, progressive geometric refinement, and structure-aware optimization, significantly enhancing global geometric fidelity and visual realism. 
Extensive experiments demonstrate that \textbf{SceneGenesis} enables flexible style variation and object-level editing while achieving superior controllability, scalability, and 3D structural quality, offering an effective solution for 3D scene synthesis.

SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted to short audio perception, leading to constrained capabilities on fine-grained tasks. We identify three key aspects that limit their temporal localization and long audio understanding: (i) timestamp representation, (ii) architecture, and (iii) data. To address this, we introduce TimeAudio, a novel method that empowers LALMs to connect their understanding of audio content with precise temporal perception. Specifically, we incorporate unique temporal markers to improve time-sensitive reasoning and apply an absolute time-aware encoding that explicitly grounds the acoustic features with absolute time information. Moreover, to realize end-to-end long audio understanding, we introduce a segment-level token merging module to substantially reduce audio token redundancy and enhance the efficiency of information extraction. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing audio datasets into a new dataset focused on temporal tasks and establish a series of metrics to evaluate the fine-grained performance. Evaluations show strong performance across a variety of fine-grained tasks, such as dense captioning, temporal grounding, and timeline speech summarization, which demonstrates TimeAudio's robust temporal localization and reasoning capabilities.

Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models

Emotional and cognitive factors are essential for understanding mental health disorders. However, existing methods often treat multi-modal data as classification tasks, limiting interpretability especially for emotion and cognition. Although large language models (LLMs) offer opportunities for mental health analysis, they mainly rely on textual semantics and overlook fine-grained emotional and cognitive cues in multi-modal inputs. While some studies incorporate emotional features via transfer learning, their connection to mental health conditions remains implicit. To address these issues, we propose ECMC, a novel task that aims at generating natural language descriptions of emotional and cognitive states from multi-modal data, and producing emotion–cognition profiles that improve both the accuracy and interpretability of mental health assessments. We adopt an encoder–decoder architecture, where modality-specific encoders extract features, which are fused by a dual-stream BridgeNet based on Q-former. Contrastive learning enhances the extraction of emotional and cognitive features. A LLaMA decoder then aligns these features with annotated captions to produce detailed descriptions.
Extensive objective and subjective evaluations demonstrate that: 1) ECMC outperforms existing multi-modal LLMs and mental health models in generating emotion–cognition captions; 2) the generated emotion–cognition profiles significantly improve assistive diagnosis and interpretability in mental health analysis.

Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding

Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Timestep Phase-aware Multi-Modal Allocation mechanism to dynamically modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization and Phase-aware Negative Classifier-Free Guidance, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations. We are committed to open-sourcing our code for community use.

Downloads

Next from AAAI 2026

Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads