United States

We propose Diff-Shadow, a global-guided diffusion model for high-quality shadow removal. Previous transformer-based approaches can utilize global information to relate shadow and non-shadow regions but are limited in their synthesis ability and recover images with obvious boundaries. In contrast, diffusion-based methods can generate better content but ignore global information, resulting in inconsistent illumination. In this work, we combine the advantages of diffusion models and global guidance to realize shadow-free restoration. Specifically, we propose a parallel UNets architecture: 1) the local branch performs the patch-based noise estimation in the diffusion process, and 2) the global branch recovers the low-resolution shadow-free images. A Reweight Cross Attention (RCA) module is designed to integrate global contextual information of non-shadow regions into the local branch. We further design a Global-guided Sampling Strategy (GSS) that mitigates patch boundary issues and ensures consistent illumination across shaded and unshaded regions in the recovered image. Comprehensive experiments on three publicly standard datasets ISTD, ISTD+, and SRD have demonstrated the effectiveness of Diff-Shadow. Compared to state-of-the-art methods, our method achieves a significant improvement in terms of PSNR, increasing from 32.33dB to 33.69dB on the ISTD dataset. Codes will be released.

AAAI 2025

Diff-Shadow: Global-guided Diffusion Model for Shadow Removal

synthesis

computational photography

video

image

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Deep neural networks (DNNs) typically employ an end-to-end (E2E) training paradigm which presents several challenges, including high GPU memory consumption, inefficiency, and difficulties in model parallelization during training. Recent research has sought to address these issues, with one promising approach being local learning. This method involves partitioning the backbone network into gradient-isolated modules and manually designing auxiliary networks to train these local modules. Existing methods often neglect the interaction of information between local modules, leading to myopic issues and a performance gap compared to E2E training. To address these limitations, we propose the Multilaminar Leap Augmented Auxiliary Network (MLAAN). Specifically, MLAAN comprises Multilaminar Local Modules (MLM) and Leap Augmented Modules (LAM). MLM captures both local and global features through independent and cascaded auxiliary networks, alleviating performance issues caused by insufficient global features. However, overly simplistic auxiliary networks can impede MLM's ability to capture global information. To address this, we further design LAM, an enhanced auxiliary network that uses the Exponential Moving Average (EMA) method to facilitate information exchange between local modules, thereby mitigating the shortsightedness resulting from inadequate interaction. The synergy between MLM and LAM has demonstrated excellent performance. Our experiments on the CIFAR-10, STL-10, SVHN, and ImageNet datasets show that MLAAN can be seamlessly integrated into existing local learning frameworks, significantly enhancing their performance and even surpassing end-to-end (E2E) training methods, while also reducing GPU memory consumption.

MLAAN: Scaling Supervised Local Learning with Multilaminar Leap Augmented Auxiliary Network

Personalized image generation enables customized content creation based on the text-to-image diffusion models.However, existing personalization methods focus on fine-tuning generative models to learn to generate specific single individuals or concepts, such as an image of a specific Corgi, but are unable to generate data for multiple individuals or concepts with common characteristics, such as images of multiple different Corgis. In this work, we focus on personalizing a diffusion model to generated varied data usually containing multiple subjects, which has a more diverse and complex data distribution. Our basic assumption is that the varied data distribution is composed of the common features shared among all samples, as well as the reasonable variations within it. Accordingly, we are capable to decompose the learning process of complex data distributions into two simpler sub-tasks, employing a divide-and-conquer approach. To this end we propose Dis2Booth, a framework that can learn complex image Distribution by Disentangling data distribution in an unsupervised manner.Specifically, Dis2Booth contains two modules, Anchor LoRA and Delta LoRA, that are tasked with learning the common features and variational features constrained by Contextual Loss and Delta Loss unsupervisedly. Besides, the Asynchronous Optimization Strategy is proposed to ensure the collaborative training of the two modules. Extensive experiments suggest that Dis2Booth is able to learn the data distribution with higher diversity and complexity while maintaining the same level of flexibility as LoRA.

Dis²Booth: Learning Image Distribution with Disentangled Features for Text-to-Image Diffusion Models

In time-lapse microscopy, inherent noise significantly limits imaging sensitivity and increases measurement uncertainty. Due to the scarcity of clean data, zero-shot approaches have emerged as highly data-efficient solutions for microscopy denoising. However, existing methods typically process video frames independently, resulting in long training times and issues such as temporal noise and over-smoothing. In this paper, we introduce MDSR-Zero, a zero-shot online learning method designed for plug-and-play noise suppression and super-resolution of microscopy videos. Our approach leverages an efficient online training strategy that reuses denoising models from previous frames. By treating the video as a continuous stream, our model significantly reduces training time and ensures temporally consistent denoising. Additionally, we propose a novel loss function tailored for denoising in the context of super-resolution, which enhances the detail in the denoised results. Extensive experiments on both synthetic and real-world noise demonstrate that our method achieves state-of-the-art performance among zero-shot denoising approaches and is competitive with self-supervised methods. Notably, our method can reduce training time by up to 10x compared to the previous SOTA method.

Efficient Online Training for Zero-Shot Time-Lapse Microscopy Denoising and Super-Resolution

With the benefit of explicit object-oriented reasoning capabilities of scene graphs, scene graph-to-image generation has  made remarkable advancements in comprehending object coherence and interactive relations. Recent state-of-the-arts typically predict the scene layouts as an intermediate representation of a scene graph before synthesizing the image. Nevertheless, transforming a scene graph into an exact layout may  restrict its representation capabilities, leading to discrepancies  in interactive relationships (such as standing on, wearing, or  covering) between the generated image and the input scene  graph. In this paper, we propose a Scene Graph-Grounded Image Generation (SGG-IG) method to mitigate the above  issues. Specifically, to enhance the scene graph representation,  we design a masked auto-encoder module and a relation embeddings learning module to integrate structural knowledge  and contextual information of the scene graph with a mask  self-supervised manner. Subsequently, to bridge the scene  graph with visual content, we introduce a spatial constraint and  image-scene alignment constraint to capture the fine-grained  visual correlation between the scene graph symbol representation and the corresponding image representation, thereby  generating semantically consistent and high-quality images. Extensive experiments demonstrate the effectiveness of the  method both quantitatively and qualitatively.

Scene Graph-Grounded Image Generation

In safety-critical domains such as medical diagnostics and autonomous driving, single-image evidence is sometimes insufficient to reflect the inherent ambiguity of vision problems. Therefore, multiple plausible assumptions that match the image semantics may be needed to reflect the actual distribution of targets and support downstream tasks. However, balancing and improving the diversity and consistency of segmentation predictions under the high-dimensional output spaces and potential multimodal distributions is still challenging. This paper presents Hierarchical Self-Regulation Diffusion (HSRDiff), a unified framework that simulates joint probability distribution over entire labels. Our model self-regulates the balance between the two modes of predicting the label and noise in a novel ``differentiation to unification" pipeline and dynamically fits the optimal path to model the aleatoric uncertainty rooted in observations. In addition, we preserve the high-fidelity reconstruction of the delicate structure in images by leveraging the hierarchical multi-scale condition priors. We validate HSRDiff in three different semantic scenarios. The experimental results show that HSRDiff is superior to the comparison method with a considerable performance gap. Our code is attached to the supplementary material.

HSRDiff: A Hierarchical Self-Regulation Diffusion Model for Stochastic Semantic Segmentation

We consider the challenge of black-box optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as problem complexity grows. In particular, we illustrate DisCo-DSO's superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.

DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces

We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate optimizers, such as AdamNC, providing a theoretical basis for its competitive optimization capability. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memoryefficient optimizers, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance on various CNN and Transformer tasks. The code implementation is available at an anonymous GitHub1 and on the supplementary file.

SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, we propose the GIM  dataset, which has the following advantages: 1) Large scale,  GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Graph transfer learning endeavors to develop a Graph Neural Network (GNN) model in a fully-labeled source domain, with the intention of deploying it on a target domain that has limited labeled data for inference. We reveal that prevalent graph transfer learning methods are susceptible to the homophily shift problem. This issue arises from the divergence in homophily structures between the source and target graphs, leading to a notable deterioration in the performance of GNN models. In this paper, we introduce a novel Contextual Structural Graph Neural Network (CS-GNN) method, leveraging a tailored attention mechanism to apprehend a variety of local structural cues, facilitating structural knowledge transfer across domains. It features an ego-network module to distill local structural diversity and a moment-based approach to gauge structural patterns without needing ground-truth labels. CS-GNN crafts a feature smoothness matrix from node attributes, guiding a customized attention mechanism for feature aggregation. A group-wise fairness loss is employed to balance learning across various structural patterns, enhancing the model's ability to transfer knowledge across domains.  Comprehensive experiments conducted on six benchmark datasets substantiate the superiority of CS-GNN over the state-of-the-art methods, demonstrating significant improvements in accuracy and robustness against homophily shifts. The source code for CS-GNN is publicly available at https://anonymous.4open.science/r/CS-GNN-ECF6/.

Contextual Structure Knowledge Transfer for Graph Neural Networks

This study challenges strictly guaranteeing ``dissipativity'' of a dynamical system represented by neural networks learned from given time-series data.
Dissipativity is a crucial indicator for dynamical systems that generalizes stability and input-output stability, known to be valid across various systems including robotics, biological systems, and molecular dynamics.
By analytically proving the general solution to the nonlinear Kalman–Yakubovich–Popov (KYP) lemma, which is the necessary and sufficient condition for dissipativity, we propose a differentiable projection that transforms any dynamics represented by neural networks into dissipative ones and a learning method for the transformed dynamics.
Utilizing the generality of dissipativity, our method strictly guarantee  stability, input-output stability, and energy conservation of trained dynamical systems.
Finally, we demonstrate the robustness of our method against out-of-domain input through applications to robotic arms and fluid dynamics.

Premium content

Next from AAAI 2025

MLAAN: Scaling Supervised Local Learning with Multilaminar Leap Augmented Auxiliary Network

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES