United States

Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential.
However, existing methods often face a dilemma: lightweight approaches, such as adapters, can result in inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. 
DreamFit has three key advantages: (1) Lightweight training: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)Anything-Dressing: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) Plug-and-play: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers.
To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained descriptions of the garment, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments both on $768 \times 512$ high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

AAAI 2025

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Generating Chain-of-Thought (CoT) before deriving the answer can effectively improve the reasoning capabilities of large language models (LLMs) and significantly improve the accuracy of the generated answer. However, in most cases, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs. Furthermore, existing research has discovered that shortening the reasoning steps in CoT, even while preserving the key information, diminishes LLMs' abilities. These phenomena make it difficult to use LLMs and CoT in many real-world applications that only require the final answer and are sensitive to latency, such as search and recommendation. To reduce the costs of model decoding and shorten the length of the generated CoT, this paper presents $\textbf{C}$onditioned $\textbf{C}$ompressed $\textbf{C}$hain-of-$\textbf{T}$hought (C3oT), a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT while maintaining key information and interpretability, a conditioned training method to train LLMs with both longer CoT and shorter CoT simultaneously to learn the corresponding relationships between them, and a conditioned inference method to gain the reasoning ability learned from longer CoT by generating shorter CoT. We conduct experiments over four datasets from arithmetic and commonsense scenarios, showing that the proposed method is capable of compressing the length of generated CoT by up to more than 50% without compromising its effectiveness.

C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness

Modern methods for autonomous driving perception widely adopt multi-modal fusion to enhance 3D scene understanding. However, existing methods suffer from inferior semantic extraction in image encoders that treat all pixels equally, ignoring contextual differences. The generated multi-modal representations also typically lack comprehensive semantic and spatial geometry information, which is crucial for the 3D panoptic segmentation task. In this paper, we propose a novel Semantic-Geometry Fusion Transformer (SGFormer) that extracts adaptive semantic contexts, aggregates geometric information and captures the semantic-geometry fusion. First,  in the Image Branch, we tailor semantic contexts for each pixel with context-guided attention and spatial context alignment to refine semantic details. Second, we transform image and voxel features into point-pixel geometry representations, simultaneously learning  semantic category priors as embeddings to better represent scene geometry and semantics. Finally, to aggregate semantic information with related geometry, we design a semantic-geometry fusion that combines the transformer, effectively capturing semantic-geometry relationships into multi-modal panoptic representations. Notably, SGFormer achieves the state-of-the-art (SOTA)  results on the nuScenes and SemanticPOSS, as well as yielding competitive performance on the SemanticKITTI. Moreover, SGFormer exhibits superior robustness  compared to leading methods, marking an improvement of 2% to 10%.

SGFormer: Semantic-Geometry Fusion Transformer for Multi-modal 3D Panoptic Segmentation

Recently, learning-based Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges for high-fidelity and high-efficiency UIE. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model's attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique significantly outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs. The code will be publicly available.

Adaptive Dual-domain Learning for Underwater Image Enhancement

Modern Hopfield networks (MHNs) have recently gained significant attention in the field of artificial intelligence because they can store and retrieve large sets of patterns with an exponentially large memory capacity. Recently, it has been proven that MHNs can be understood as a dynamical system defined with Lagrangian functions of memory and feature neurons, where memories associated with in-distribution (ID) training samples are represented as attractors in the feature space. However, one of the primary challenges remains in managing out-of-distribution (OOD) samples because MHNs are formulated under the assumption that all data samples are ID samples. To address this challenge, we propose the rectified Lagrangian (RegLag), a new Lagrangian function for memory neurons that explicitly incorporates an attractor for OOD samples in the dynamical system of MHNs. RecLag is designed to create a trivial point attractor for any interaction matrix, enabling OOD detection by identifying samples that fall into this attractor as OOD. Furthermore, for training MHNs with RecLag, we devise a method based on probabilistic interaction, by which data samples with low probability density values fall into the created attractor. In experiments, we demonstrated the effectiveness of RecLag-based MHNs compared to energy-based OOD detection methods, including those using state-of-the-art Hopfield energies, across nine image datasets.

Rectified Lagrangian for Out-of-Distribution Detection in Modern Hopfield Networks

Generalizable Deepfake Detection aims to develop universal detectors capable of identifying diverse types of forgery images using limited training samples. Recent research has uncovered that the features of large pre-trained models, such as CLIP, can be effectively utilized for deepfake detection via linear classifiers, even on unseen sources. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective for deepfake detection through a linear classifier; and 2) determining how to simply and effectively improve the detection performance of CLIP. In this study, to elucidate the underlying mechanism of CLIP's detection capabilities, we decode the detection features of CLIP into text and perform word frequency analysis. Our findings indicate that CLIP performs deepfake detection by identifying similar concepts (Fig. \ref{fig:fig1} a). Building on this insight, we introduce Category Common Prompt CLIP, called C2P-CLIP, a novel and effective approach designed to augment detection performance. This method employs the category common prompt on the text encoder to inject category concepts into the vision encoder, thereby enhancing its discrimination ability between real and fake items (Fig. \ref{fig:fig1} b). Our C2P-CLIP method achieves a 12.41\% improvement in detection performance compared to the original CLIP, without adding additional parameters during testing. Comprehensive experiments conducted on two widely-used datasets, encompassing 20 generation models, validate the efficacy of our proposed method, demonstrating state-of-the-art performance. The code will be released.

C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

Large-scale text-guided image diffusion models have shown astonishing results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-and-inpainting approach managed to preserve generation diversity but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that leverages pretrained diffusion models. We first introduce a local attention reweighing mechanism in the self-attention layers to guide the model in concentrating on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques regarding texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

Stable, Consistent and High-Quality Text-Driven Texture Generation

In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed target output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task. Moreover, it exhibits superior performance across the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), which further demonstrates the remarkable superiority and generalizability of the proposed HieA2G.

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension

The questionable responses caused by knowledge hallucination may lead to LLMs' unstable ability in decision-making. However, it has never been investigated whether the LLMs' hallucination is possibly usable for generating negative reasoning to assist fake news detection. In this paper, we propose a novel supervised self-reinforced reasoning rectification approach - SR^3 that not only yields common reasonable reasoning for news but also forces LLMs to generate the wrong understandings of news via LLMs reflection for semantic consistency learning. Upon that, we construct a negative reasoning-based news learning model called - NRFE, which leverages positive or negative news-reasoning pairs for learning the semantic consistency between them. To avoid the impact of label-implicated reasoning, we deploy a student model - NRFE-D that only takes news content as input to inspect the performance of our method by distilling the knowledge from NRFE. The experimental results verified on three popular fake news datasets demonstrate the superiority of our method compared with three kinds of baselines including prompting-based LLMs, fine-tuning-based PLMs, and other representative fake news detection methods.

Is LLMs Hallucination Usable? LLM-based Negative Reasoning for Fake News Detection

We present RS-Diffusion, the first Diffusion Models-based method for single-frame Rolling Shutter (RS) correction. RS artifacts compromise visual quality of frames due to the row-wise exposure of CMOS sensors. Most previous methods have focused on multi-frame approaches, using temporal information from consecutive frames for the motion rectification. However, few approaches address the more challenging but important single frame RS correction. In this work, we present an ``image-to-motion" framework via diffusion techniques, with a designed patch-attention module. In addition, we present the RS-Real dataset, comprised of captured RS frames alongside their corresponding Global Shutter (GS) ground-truth pairs. The GS frames are corrected from the RS ones, guided by the corresponding Inertial Measurement Unit (IMU) gyroscope data acquired during capture. Experiments show that RS-Diffusion surpasses previous single RS methods, and we believe that our work establishes a solid foundation for advancing the field of rolling shutter correction, demonstrates the potential of single-frame methods, and provides a valuable dataset for further research. The code and dataset will be released.

Single Image Rolling Shutter Removal with Diffusion Models

Pedestrian Attribute Recognition (PAR) is one of the indispensable tasks in human-centered research. However, existing datasets neglect different domains (e.g., environments, times, populations, and data sources), only conducting simple random splits, and the performance of these datasets has already approached saturation. In the past five years, no large-scale dataset has been opened to the public. To address this issue, this paper proposes a new large-scale, cross-domain pedestrian attribute recognition dataset to fill the data gap, termed MSP60K. It consists of 60,122 images and 57 attribute annotations across eight scenarios. Synthetic degradation is also conducted to further narrow the gap between the dataset and real-world challenging scenarios. To establish a more rigorous benchmark, we evaluate 17 representative PAR models under both random and cross-domain split protocols on our dataset. Additionally, we propose an innovative Large Language Model (LLM) augmented PAR framework, named LLM-PAR. This framework processes pedestrian images through a Vision Transformer (ViT) backbone to extract features and introduces a multi-embedding query Transformer to learn partial-aware features for attribute classification. Significantly, we enhance this framework with LLM for ensemble learning and visual feature augmentation. Comprehensive experiments across multiple PAR benchmark datasets have thoroughly validated the efficacy of our proposed framework. Both the dataset and source code will be released upon acceptance.

Premium content

Next from AAAI 2025

C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES