Singapore

Multi-modal salient object detection (MSOD), which integrates complementary modalities such as depth or thermal data, primarily faces two challenges: accurately preserving salient object details and effectively aligning cross-modal features. Recent advances in using Stable Diffusion to generate images with fine edge details have inspired researchers to reformulate MSOD as a conditional mask generation process guided by salient features, which has achieved excellent visual results. However, these approaches often overlook the high computational cost and large-scale architecture of Stable Diffusion, both of which render it unsuitable for real-world MSOD applications.
Therefore, we propose SimpleDiffusion, the first lightweight and efficient conditional diffusion model for MSOD that does not rely on Stable Diffusion. Specifically, we propose an Adaptive Cross-Modal Fusion Conditional Network and a Latent Denoising Network to reduce the complexity of diffusion models. Furthermore, we design a Multi-modal Feature Rectification and Fusion Module to enhance the representational capacity of cross-modal salient features. Customized training and sampling strategies are also developed to improve inference efficiency and reduce erroneous object segmentations. Experiments on multiple MSOD datasets demonstrate that SimpleDiffusion reduces model size by over tenfold and improves inference speed by more than fivefold compared to other diffusion-based methods, while maintaining comparable or superior performance.Codes and models are available at: https://anonymous.4open.science/r/simple-diffusion.

AAAI 2026

SimpleDiffusion: A Lightweight and Efficient Conditional Diffusion Model for Multi-Modal Salient Object Detection

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Point cloud processing has become a cornerstone technology in many 3D vision tasks. However, arbitrary rotations introduce variations in point cloud poses, posing a long-standing challenge for effective representation learning. The core of this issue is the disruption of the point cloud's intrinsic directional characteristics caused by rotational perturbations. Recent methods attempt to implicitly model rotational equivariance and invariance, preserving directional information and propagating it into deep semantic spaces. Yet, they often fall short of fully exploiting the multiscale directional nature of point clouds to enhance feature representations. To address this, we propose the Direction-Perceptive Vector Network (DiPVNet). At its core is an atomic dot-product operator that simultaneously encodes directional selectivity and rotation invariance—endowing the network with both rotational symmetry modeling and adaptive directional perception. At the local level, we introduce a Learnable Local Dot Product (L2DP) Operator, which enables interactions between a center point and its neighbors to adaptively capture the non-uniform local structures of point clouds. At the global level, we leverage generalized harmonic analysis to prove that the dot product between point clouds and spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform (DASFT). This leads to the construction of a global directional response spectrum for modeling holistic directional structures. We rigorously prove the rotation invariance of both operators. Extensive experiments on challenging scenarios involving noise and large-angle rotations demonstrate that DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks. The code will be released publicly.

Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning

Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Current large language models (LLMs) exhibit significant deficiencies in episodic memory tasks including encoding, storing, and retrieving specific information from temporally dependent events over a long period of time. Recent approaches to handle memory tasks in LLMs, such as in-context learning, retrieval-augmented generation (RAG), and fine-tuning, may resolve the long-term retention issues, but are still inadequate to handle tasks requiring chronological awareness of the stored information. We introduce Agentic Retrieval with Temporal-Episodic Memory (ARTEM), a hybrid LLM-based agent architecture integrating LLMs with a self-organizing neural network named Spatial-Temporal Episodic Memory (STEM), designed to handle episodic memory tasks. Our approach employs LLMs for event extraction from the inputs to represent temporal, spatial, entitative, and semantic information that may facilitate future retrieval, aside from generating outputs or direct responses. The extracted events can then be encoded vectorially and stored in a fast and stable manner in the episodic memory through an instance-based incremental learning in STEM. STEM supports precise episodes retrieval and helps reduce computational overhead in generating the appropriate responses by LLMs. Evaluation on standardized episodic memory benchmarks across four tasks—partial cue retrieval, epistemic uncertainty detection, recent event identification, and chronological recall—demonstrates superior performance of ARTEM compared to in-context learning, RAG, and fine-tuning in various popular LLMs. Code and appendices are available at \textit{https://github.com/cassthm/ARTEM}.

ARTEM: Enhancing Large Language Model Agents with Spatial-Temporal Episodic Memory

Developing Medical AI relies on large datasets and easily suffers from data scarcity. Generative data augmentation (GDA) using AI generative models offers a solution to synthesize realistic medical images. However, the bias in GDA is often underestimated in medical domains, with concerns about the risk of introducing detrimental features generated by AI and harming downstream tasks. This paper identifies the frequency misalignment between real and synthesized images as one of the key factors underlying unreliable GDA and proposes the Frequency Recalibration (FreRec) method to reduce the frequency distributional discrepancy and thus improve GDA. FreRec involves (1) Statistical High-frequency Replacement (SHR) to roughly align high-frequency components and (2) Reconstructive High-frequency Mapping (RHM) to enhance image quality and reconstruct high-frequency details. Extensive experiments were conducted in various medical datasets, including brain MRIs, chest X-rays, and fundus images. The results show that FreRec significantly improves downstream medical image classification performance compared to uncalibrated AI-synthesized samples. FreRec is a standalone post-processing step that is compatible with any generative model and can integrate seamlessly with common medical GDA pipelines.

Rethinking Bias in Generative Data Augmentation for Medical AI: A Frequency Recalibration Method

Temporal graphs are essential for modeling complex real-world systems, such as social interactions, financial transactions, and recommendation system, but the high computational cost and model complexity pose practical challenges for deploying dynamic graph neural networks (DGNNs). Although various pruning and sampling techniques have proven effective in accelerating static GNNs, these approaches fall short in dynamic settings due to temporal dependencies in evolving graph structures. To address these challenges, we propose TrimDG, a general framework that accelerates DGNNs by eliminating both static and runtime redundancy. For static redundancy, we design a novel node influence metric, Temporal Personalized PageRank (TPP), to prune less informative nodes and apply temporal binning to remove redundant events. For runtime redundancy during training, we introduce an adaptive sampling strategy guided by graph bottlenecks and reduce sampling frequency by temporal batch selector and sampling cache. Theoretical analysis supports our design, and experiments on real-world datasets show that TrimDG reduces runtime by an average of 83.80\% across diverse DGNN backbones, while maintaining strong predictive performance, demonstrating both its efficiency and generalizability.

Trimming the Fat: Redundancy-Aware Acceleration Framework for DGNNs

Continual instruction tuning (CIT) has emerged as a promising strategy for adapting large language models (LLMs) to new tasks while preserving historical knowledge. Most existing CIT methods have focused on offline CIT (offCIT), which assumes clearly defined task boundaries and allows multiple passes over the data. However, such assumptions rarely hold in real-world scenarios, where data arrive in a streaming fashion and task boundaries are unknown. This setting introduces critical challenges: the absence of task identifiers (task IDs), a significant imbalance in task-specific information, and inaccessibility to previously seen data. In this work, we propose Online Editing with Decoupled Implicit Task (OnEDIT), an online CIT(onCIT) approach to tackle these challenges. OnEDIT leverages a fixed-size adapter for the implicit task, balancing current and past knowledge through editing operations every time step without relying on task IDs or backpropagation. Extensive experiments on CIT benchmarks demonstrate that OnEDIT consistently maintains robust and stable performance, whereas existing state-of-the-art baselines often suffer from performance degradation in online settings. It suggests that OnEDIT achieves superior generalization across diverse task orders and model scales, while maintaining high efficiency and low memory overhead.

OnEDIT: Online Editing with Decoupled Implicit Task for Large Language Models

Although geometric reconstruction of general objects from images has made remarkable progress in recent years, slender structures remain largely underexplored, despite their critical importance in engineering, biomedical, and agricultural applications. 
To bridge this gap, we propose a dedicated 2DGS-based geometric reconstruction framework tailored for slender structures, achieving accurate and faithful geometry recovery.
Our method first addresses the challenge that most slender objects are texture-less, which hinders reliable feature matching and pose estimation in traditional SfM pipelines.
By leveraging the curve-like nature of slender structures, we perform a curve-guided SfM process that provides robust camera poses and accurate 3D curve initialization for Gaussian primitives.
To ensure SfM reliability, we introduce a high-precision mask extraction strategy that integrates geometric priors with a segmentation network, effectively handling self-occlusion and thin geometry.
Furthermore, to enhance fine geometric recovery, we incorporate a differentiable Poisson reconstruction module to extract an initial mesh during training, which is then refined via image-space iterative optimization using differentiable mesh rasterization.

Slender3D: Curve-Guided Multi-View Reconstruction of Slender Structures

Cross-Domain Few-Shot Object Detection (CD-FSOD) is an extremely challenging task due to the inherent data scarcity and substantial domain shift between the source and target domains. Existing methods often suffer from overfitting and noisy feature representations, which hinder the construction of discriminative class prototypes in the target domain. In this paper, we propose a novel framework with sparse instance learning (SI-ViTO) for CD-FSOD, which leverages instance sparsity to achieve a better detection with less representation. SI-ViTO adopts a dual-stage sparsity module, consisting of instance feature sparsity not only on the few-shot support images but also on the query images. This dual sparsity enables the model to effectively preserve salient foreground semantics and simultaneously to filter out redundant or noisy information. Furthermore, a new prototype calibration strategy is also used to dynamically refine the class prototypes with query instances to accelerate prototype adaptation. Extensive experimental results on CD-FSOD benchmarks show that SI-ViTO outperforms the state-of-the-art methods, demonstrating that less discriminative representations yield better cross-domain few-shot object detection performance than more abundant ones.

Less Is Better: Sparse Instance Learning for Cross-Domain Few-Shot Object Detection

Graph Neural Networks (GNNs) are expressive architectures for learning from complex graph-structured data. However, their practical use is often limited by the high computational cost of neighborhood aggregation. Recent efforts have focused on knowledge distillation from GNNs to inference-efficient Multi-Layer Perceptrons (MLPs). However, most existing works treat this distillation as an embedding alignment problem, overlooking the need to replicate the topology-aware smoothing behavior that arises from message passing in GNNs. Moreover, existing methods are primarily performance driven, ignoring critical real-world requirements such as fairness. In this work, we make two key observations: $\textit{(1)}$ state-of-the-art distillation methods fail to capture the heterogeneous smoothness patterns of GNNs, limiting structural awareness in MLPs, and $\textit{(2)}$ they introduce significant individual and group fairness violations. We introduce $\texttt{FAITH}$, the first $\textit{fair and structurally aware GNN-to-MLP distillation framework with graph-free inference.}$ To improve structural awareness in MLPs, we propose a neighborhood-guided energy alignment objective that transfers not only node-level energy, but also the distribution of energies across local neighborhoods. To improve individual fairness, $\texttt{FAITH}$ introduces a novel $\ell_{2,1}$-norm objective that preserves structured similarity in the learned representations. Additionally, we incorporate a counterfactual invariance objective that explicitly encourages the model to learn representations that are statistically independent of the sensitive attribute. We provide a comprehensive theoretical analysis of $\texttt{FAITH}$, interpreting it through a novel instantiation of the Information Bottleneck principle. Extensive experiments on 11 benchmark datasets show that $\texttt{FAITH}$ achieves stronger structural awareness and delivers a better trade-off between utility and fairness than existing methods.

Leap of FAITH from GNN-to-MLP: Fairness Aware Inference via DisTillation of GrapH Knowledge

The sparsity of user–item interactions remains a fundamental obstacle in collaborative filtering, limiting the ability of Graph Neural Network (GNN)-based recommender systems to capture high-order user relationships without incurring over-smoothing and computational overhead. Existing social recommendation approaches mitigate this by incorporating social networks, yet most rely on explicit ties and fail to construct informative links in their absence. Meanwhile, contrastive learning (CL) has shown promise in improving representation quality, but current view generation strategies, augmentation-based for robustness and nonaugmentation-based for semantic fidelity, are seldom combined, leaving their complementary potential underexplored. We propose Social Generating with Multiview-guided Tuning (SGMT), a unified framework that addresses both challenges. First, an interest-aware social generation mechanism constructs synthetic user–user links from shared interaction patterns, theoretically shown to compress collaborative paths and uncover latent high-order relations. Second, we present two complementary CL modules, Noise-augmented View and Semantic-explored View, which we theoretically prove to preferentially enhance uniformity and alignment, respectively, two fundamental objectives in CL. Experiments on three real-world datasets show that SGMT outperforms state-of-the-art baselines, validating both the theoretical analysis and the practical efficacy of our model.

Content not yet available

Downloads

Next from AAAI 2026

Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Content not yet available

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads