Singapore

Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domains, scenes, objects, and attributes. A key challenge in ZS-CIR is training models on datasets with limited intention-relevant data to accurately interpret human intent, as implicitly expressed through textual modifications, to retrieve the desired images. In this paper, we introduce an intention-based image-text dataset generated through reasoning by a Multimodal Large Language Model (MLLM) to enhance ZS-CIR model training for interpreting human manipulation intents. Leveraging this dataset, we propose De-MINDS, a novel framework that distills the MLLM’s reasoning capabilities to capture human intentions, thereby improving ZS-CIR models’ comprehension of manipulation text. Specifically, a simple mapping network translates image information into language space, forming a query with the manipulation text. De-MINDS then extracts intention-relevant information from the query, converting it into pseudo-word tokens for accurate ZS-CIR. De-MINDS demonstrates robust generalization and significant performance improvements across four ZS-CIR tasks, outperforming existing methods by 2.15% to 4.05% and establishing new state-of-the-art results with comparable inference times. Our code is available at https://anonymous.4open.science/r/De-MINDS/.

AAAI 2026

Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval

manipulation intention understanding

zero-shot composed image retrieval

multimodal large language model

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

With the booming development of multimodal data (e.g., image, text) on internet platforms, multimodal sequential recommendation methods continue to emerge. Most existing methods incorporate item modal features as auxiliary information, typically concatenating them to learn unified user representations. However, these methods directly use modal features for representation learning, neglecting the impact of inherent modality noise. We argue that internal-modality noise and cross-modality noise hinder the acquisition of more accurate user representations.
To address this problem, we propose SGP4SR - Seperated-modality Guided user Perference learning for multimodal Sequential Reconmmendation. Globally, the user preference modeling is carried out from a separated-modality perspective to alleviate cross-modality noise. Locally, for each individual modality, we use item relationship graphs and user interest centers, aggregated with ID embeddings, to replace direct modal features, thereby mitigating internal-modality noise. Finally, user representations from both separated-modality and multimodal perspectives participate in prediction independently.
In experiments conducted on four real-world datasets, our method outperforms state-of-the-art approaches, achieving an average performance improvement of up to 8.84\% over the best baseline. The comprehensive experiments further validate the superior noise tolerance and robustness of our method. The source code will be available in the supplementary materials.

SGP4SR: Seperated-Modality Guided User Perference Learning for Multimodal Sequential Reconmmendation

Head avatar generation is facilitated to construct high-fidelity 3D virtual personas from a single portrait, but it also raises the risk of unauthorized personal avatars generation. Recent 2D portrait protection methods actively prevent malicious image generation by perturbing the identity features. However, there are two key limitations when directly applied to prevent 3D head avatar generation: 1) These methods neglect the inherent 3D geometric structure of portrait, thus failing to disrupt the modeling of 3D shapes or poses. 2) They focus only on identity offset and are unable to interfere with the overall appearance, resulting in excessive preservation of facial characteristics. To overcomes these limitations, we propose a 3D defense framework termed Anti-Avatar, tailored to protect against unauthorized 3D head avatar generation from a single portrait. Specifically, Anti-Avatar consists of two key designs: Geometric Disruption and Perceptual Confusion.
The former disrupts the precise reconstruction of 3D structure by interfering with the estimation of geometric parameters, thus affecting the structural accuracy of the 3D avatar.
Collaboratively, the latter confuses image features by dispersing attention distribution, thereby hindering the effective perception of portrait appearance.
Benefiting from the above dual-space divergence in geometry and perception, the avatars generated by our protected portraits exhibit substantial discrepancies from the originals.
Extensive experiments show that our Anti-Avatar outperforms 2D methods in protection performance and effectively resists reconstruction and manipulation by state-of-the-art 3D head avatar generation methods.

Anti-Avatar: Protect Against Unauthorized 3D Head Avatar Generation via Dual-Space Divergence

Recent GS-based rendering has made significant progress for LiDAR, surpassing Neural Radiance Fields (NeRF) in both quality and speed. However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. Specifically, we introduce a controllable LiDAR generation model conditioned on coarsely extrapolated rendering to produce extra geometry-consistent scans and employ an effective distillation mechanism for expansive LiDAR Gaussian reconstruction.
By extending reconstruction to under-fitted regions, our approach ensures global geometric consistency for extrapolative novel views while preserving detailed scene surfaces captured by sensors. Experiments on multiple public datasets demonstrate that LiDAR-GS++ achieves state-of-the-art performance for both interpolated and extrapolated viewpoints, surpassing existing GS and NeRF-based methods.

LiDAR-GS++: Improving LiDAR Gaussian Reconstruction via Diffusion Priors

Fully fine-tuning large pre-trained models for each downstream task is impractical due to prohibitive memory, computation, and storage costs. Although parameter-efficient fine-tuning (PEFT) methods address this issue, leading methods like LoRA still exhibit linear scaling of trainable parameters with hidden size. Recent studies have explored PEFT in the frequency domain to reduce computational costs by employing fast Fourier transform and discrete cosine transform with sparse frequency selection. These methods rely on global frequency representations that lack spatial locality and disperse energy across the domain. As a result, sparse coefficient selection struggles to preserve fine-grained structural information and often introduces artifacts such as ringing near boundaries. To address these limitations, we propose DWTSG, a novel PEFT framework based on discrete wavelet transform (DWT) and subband guidance. DWTSG decomposes pre-trained weights into four wavelet subbands that jointly encode global context and local details. It fine-tunes only the most informative coefficients in each subband through an energy-based selection strategy that prioritizes coefficients based on their individual importance and interactions. Finally, inverse DWT reconstructs the updated weights, enabling efficient and precise adaptation. Extensive experiments on natural language understanding, commonsense reasoning, and image classification demonstrate that DWTSG outperforms existing PEFT methods, achieving superior performance and higher parameter efficiency.

DWTSG: Parameter-Efficient Fine-Tuning of Large Pre-trained Models via Discrete Wavelet Transform and Subband Guidance

Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textit{build-then-reason} paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG's inherent incompleteness often breaks reasoning paths. Second, the graph’s low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that derails the reasoning process.
To address these challenges, we argue for a \textit{reason-and-construct} paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbf{Relink} instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query.
Extensive experiments on five ODQA benchmarks show that Relink achieves significant average improvements of 5.4\% in EM and 5.2\% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework. 
The code is available at https://github.com/DMiC-Lab-HFUT/Relink.

Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. 
Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space.
Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-k singular subspace using projections PL = I − Uk Ukᵀ and PR = I − Vk Vkᵀ. We prove that this construction exactly preserves the top-k singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce ρk, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.

OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting During Parameter-Efficient Fine-Tuning

Free-Viewpoint Video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representation remains a major challenge. Existing dynamic 3D Gaussian Splatting methods couple reconstruction with optimization-dependent compression and customized motion formats, limiting generalization and standardization. To address this, we propose D-FCGS, a novel Feedforward Compression framework for Dynamic Gaussian Splatting. Key innovations include: (1) a standardized Group-of-Frames (GoF) structure with I-P coding, leveraging sparse control points to extract inter-frame motion tensors; (2) a dual prior-aware entropy model that fuses hyperprior and spatial-temporal priors for accurate rate estimation; (3) a control-point-guided motion compensation mechanism and refinement network to enhance view-consistent fidelity. Trained on Gaussian frames derived from multi-view videos, D-FCGS generalizes across diverse scenes in a zero-shot fashion. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression compared to the baseline while preserving visual quality across viewpoints. This work advances feedforward compression of dynamic 3DGS, facilitating scalable FVV transmission and storage for immersive applications.

D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

Multimodal pretraining has revolutionized visual understanding, but its impact on video-based person re-identification (ReID) remains underexplored. Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion—an essential cue for distinguishing identities in video.
In this work, we take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID.
To achieve this, we propose Contrastive Skeleton-Image Pretraining for ReID (CSIP-ReID), a novel two-stage method that leverages skeleton sequences as a spatiotemporally informative modality aligned with video frames. In the first stage, we employ contrastive learning to align skeleton and visual features at sequence level. In the second stage, we introduce a dynamic Prototype Fusion Updater (PFU) to refine multimodal identity prototypes, fusing motion and appearance cues. Moreover, we propose a Skeleton Guided Temporal Modeling (SGTM) module that distills temporal cues from skeleton data and integrates them into visual features. Extensive experiments demonstrate that CSIP-ReID achieves new state-of-the-art results on standard video ReID benchmarks (MARS, LS-VID, iLIDS-VID). 
Moreover, it exhibits strong generalization to skeleton-only ReID tasks (BIWI, IAS), significantly outperforming previous methods.
CSIP-ReID pioneers an annotation-free and motion-aware pretraining paradigm for ReID, opening a new frontier in multimodal representation learning.

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

Despite being trained on balanced datasets, existing deepfake detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shifts in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shifts. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. Specifically, we introduce a learnable scalar correction to the model’s logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shifts in model output, realigning the decision boundary without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution to threshold miscalibration in deepfake detection. Our code will be released.

Downloads

Next from AAAI 2026

SGP4SR: Seperated-Modality Guided User Perference Learning for Multimodal Sequential Reconmmendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

SGP4SR: Seperated-Modality Guided User Perference Learning for Multimodal Sequential Reconmmendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads