Singapore

Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

AAAI 2026

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

cv: large vision models

cv: multi-modal vision

cv: segmentation

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges due to significant view variations along the vertical spatial axis. Unlike traditional ground-based settings, UAVs capture actions at a wide range of altitudes, resulting in considerable appearance discrepancies. We introduce a multi-view formulation tailored to varying UAV altitudes and empirically observe a partial order among views, where recognition accuracy consistently decreases as altitude increases. This observation motivates a novel approach that explicitly models the hierarchical structure of UAV views to improve recognition performance across altitudes. To this end, we propose the Partial Order Guided Multi-View Network (POG-MVNet), designed to address drastic view variations by effectively leveraging view-dependent information across different altitude levels. The framework comprises three key components: a View Partition (VP) module, which uses the head-to-body ratio to group views by altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles action-relevant and view-specific features under partial order guidance; and an Action Partial Order Guide (APOG), which uses the partial order to transfer informative knowledge from easier views to more challenging ones. We conduct experiments on Drone-Action, MOD20, and UAV, demonstrating that POG-MVNet significantly outperforms competing methods. For example, POG-MVNet achieves a 4.7% improvement on Drone-Action and a 3.5% improvement on UAV compared to state-of-the-art methods ASAT and FAR. Code will be released soon.

Beyond the Horizon: Decoupling Multi-View UAV Action Recognition via Partial Order Transfer

Open-vocabulary object detection (OVOD) holds promise for remote sensing, yet the natural-to-aerial image domain gap hinders generalization. Dominant backgrounds, sparse labels with limited semantics, and semi-supervised training difficulties pose significant challenges. We introduce SOAR (\textbf{S}emi-supervised \textbf{O}pen-vocabulary \textbf{A}erial Object \textbf{R}ecognition via Dual-aware Enhanced Prior Denoising), which generates pseudo-labels for semi-supervised training by learning implicit foreground priors and performing efficient denoising. Specifically, we dynamically extract background features and implicitly model foreground priors, treating them as noisy ground truth. These are then denoised through a refiner to obtain pseudo-labels. Besides, we further introduce a dual-aware query enhancement (DAQE) module that integrates language and foreground prior information to enhance the effectiveness of query selection and feature augmentation. Additionally, we address the sparsity of label information through expansion and aggregation techniques, further improving model performance. Finally, experimental evaluations reveal that, in the open-vocabulary object detection task on the DIOR dataset, our method achieves a mean Average Precision (mAP) of 68.5\% and Harmonic Mean (HM) of 55.9\%, outperforming the previous state-of-the-art model’s mAP of 61.6\% and HM of 53.6\%. Our approach offers a new solution to the open-vocabulary challenge in aerial object detection. The source code will be available.

SOAR: Semi-Supervised Open-Vocabulary Aerial Object Detection via Dual-Aware Enhanced Prior Denoising

Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domains, scenes, objects, and attributes. A key challenge in ZS-CIR is training models on datasets with limited intention-relevant data to accurately interpret human intent, as implicitly expressed through textual modifications, to retrieve the desired images. In this paper, we introduce an intention-based image-text dataset generated through reasoning by a Multimodal Large Language Model (MLLM) to enhance ZS-CIR model training for interpreting human manipulation intents. Leveraging this dataset, we propose De-MINDS, a novel framework that distills the MLLM’s reasoning capabilities to capture human intentions, thereby improving ZS-CIR models’ comprehension of manipulation text. Specifically, a simple mapping network translates image information into language space, forming a query with the manipulation text. De-MINDS then extracts intention-relevant information from the query, converting it into pseudo-word tokens for accurate ZS-CIR. De-MINDS demonstrates robust generalization and significant performance improvements across four ZS-CIR tasks, outperforming existing methods by 2.15% to 4.05% and establishing new state-of-the-art results with comparable inference times. Our code is available at https://anonymous.4open.science/r/De-MINDS/.

Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval

With the booming development of multimodal data (e.g., image, text) on internet platforms, multimodal sequential recommendation methods continue to emerge. Most existing methods incorporate item modal features as auxiliary information, typically concatenating them to learn unified user representations. However, these methods directly use modal features for representation learning, neglecting the impact of inherent modality noise. We argue that internal-modality noise and cross-modality noise hinder the acquisition of more accurate user representations.
To address this problem, we propose SGP4SR - Seperated-modality Guided user Perference learning for multimodal Sequential Reconmmendation. Globally, the user preference modeling is carried out from a separated-modality perspective to alleviate cross-modality noise. Locally, for each individual modality, we use item relationship graphs and user interest centers, aggregated with ID embeddings, to replace direct modal features, thereby mitigating internal-modality noise. Finally, user representations from both separated-modality and multimodal perspectives participate in prediction independently.
In experiments conducted on four real-world datasets, our method outperforms state-of-the-art approaches, achieving an average performance improvement of up to 8.84\% over the best baseline. The comprehensive experiments further validate the superior noise tolerance and robustness of our method. The source code will be available in the supplementary materials.

SGP4SR: Seperated-Modality Guided User Perference Learning for Multimodal Sequential Reconmmendation

Head avatar generation is facilitated to construct high-fidelity 3D virtual personas from a single portrait, but it also raises the risk of unauthorized personal avatars generation. Recent 2D portrait protection methods actively prevent malicious image generation by perturbing the identity features. However, there are two key limitations when directly applied to prevent 3D head avatar generation: 1) These methods neglect the inherent 3D geometric structure of portrait, thus failing to disrupt the modeling of 3D shapes or poses. 2) They focus only on identity offset and are unable to interfere with the overall appearance, resulting in excessive preservation of facial characteristics. To overcomes these limitations, we propose a 3D defense framework termed Anti-Avatar, tailored to protect against unauthorized 3D head avatar generation from a single portrait. Specifically, Anti-Avatar consists of two key designs: Geometric Disruption and Perceptual Confusion.
The former disrupts the precise reconstruction of 3D structure by interfering with the estimation of geometric parameters, thus affecting the structural accuracy of the 3D avatar.
Collaboratively, the latter confuses image features by dispersing attention distribution, thereby hindering the effective perception of portrait appearance.
Benefiting from the above dual-space divergence in geometry and perception, the avatars generated by our protected portraits exhibit substantial discrepancies from the originals.
Extensive experiments show that our Anti-Avatar outperforms 2D methods in protection performance and effectively resists reconstruction and manipulation by state-of-the-art 3D head avatar generation methods.

Anti-Avatar: Protect Against Unauthorized 3D Head Avatar Generation via Dual-Space Divergence

Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive LiDAR Gaussian reconstruction.
By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.

LiDAR-GS++: Improving LiDAR Gaussian Reconstruction via Diffusion Priors

Fully fine-tuning large pre-trained models for each downstream task is impractical due to prohibitive memory, computation, and storage costs. Although parameter-efficient fine-tuning (PEFT) methods address this issue, leading methods like LoRA still exhibit linear scaling of trainable parameters with hidden size. Recent studies have explored PEFT in the frequency domain to reduce computational costs by employing fast Fourier transform and discrete cosine transform with sparse frequency selection. These methods rely on global frequency representations that lack spatial locality and disperse energy across the domain. As a result, sparse coefficient selection struggles to preserve fine-grained structural information and often introduces artifacts such as ringing near boundaries. To address these limitations, we propose DWTSG, a novel PEFT framework based on discrete wavelet transform (DWT) and subband guidance. DWTSG decomposes pre-trained weights into four wavelet subbands that jointly encode global context and local details. It fine-tunes only the most informative coefficients in each subband through an energy-based selection strategy that prioritizes coefficients based on their individual importance and interactions. Finally, inverse DWT reconstructs the updated weights, enabling efficient and precise adaptation. Extensive experiments on natural language understanding, commonsense reasoning, and image classification demonstrate that DWTSG outperforms existing PEFT methods, achieving superior performance and higher parameter efficiency.

DWTSG: Parameter-Efficient Fine-Tuning of Large Pre-trained Models via Discrete Wavelet Transform and Subband Guidance

Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textit{build-then-reason} paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG's inherent incompleteness often breaks reasoning paths. Second, the graph’s low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that derails the reasoning process.
To address these challenges, we argue for a \textit{reason-and-construct} paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbf{Relink} instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query.
Extensive experiments on five ODQA benchmarks show that Relink achieves significant average improvements of 5.4\% in EM and 5.2\% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework. 
The code is available at https://github.com/DMiC-Lab-HFUT/Relink.

Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. 
Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space.
Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-k singular subspace using projections PL = I − Uk Ukᵀ and PR = I − Vk Vkᵀ. We prove that this construction exactly preserves the top-k singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce ρk, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.

Downloads

Next from AAAI 2026

Beyond the Horizon: Decoupling Multi-View UAV Action Recognition via Partial Order Transfer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Beyond the Horizon: Decoupling Multi-View UAV Action Recognition via Partial Order Transfer

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads