United States

We present OOTDiffusion, a novel network architecture for realistic and controllable image-based virtual try-on (VTON). We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the garment detail features. Without a redundant warping process, the garment features are precisely aligned with the target human body via the proposed outfitting fusion in the self-attention layers of the denoising UNet. In order to further enhance the controllability, we introduce outfitting dropout to the training process, which enables us to adjust the strength of the garment features through classifier-free guidance. Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating an impressive breakthrough in virtual try-on. Our source code is publicly available (for the review process, please refer to our supplementary material).

AAAI 2025

OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Recently, patch deformation-based methods have demonstrated significant strength in multi-view stereo by adaptively expanding the reception field of deformed patches to reconstruct textureless areas. 
However, such methods mainly concentrate on searching for pixels without matching ambiguity (i.e., reliable pixels) when constructing deformed patches, while neglecting the deformation instability caused by unexpected edge-skipping, resulting in potential matching distortions.
Addressing this, we propose MGP-MVS, a method introducing multi-granularity segmentation prior for edge-confined patch deformation. 
Specifically, to prevent unexpected edge-skipping, we first aggregate and further refine multi-granularity depth edges gained from Semantic-SAM as prior to guide patch deformation within depth-continuous (i.e., homogeneous) areas. 
Moreover, to address attention imbalance caused by edge-confined patch deformation, we implement adaptive equidistribution and disassemble-clustering of correlative reliable pixels (i.e., anchors), thereby promoting attention-consistent patch deformation.
Finally, to prevent deformed patches from falling into local-minimum matching costs caused by the fixed sampling pattern, we introduce disparity-sampling synergistic 3D optimization to help identify global-minimum matching costs.
Evaluations on ETH3D and Tanks & Temples benchmarks prove our method obtains state-of-the-art performance with remarkable generalization.

MSP-MVS: Multi-Granularity Segmentation Prior Guided Multi-View Stereo

We consider a general and realistic scenario involving non-stationary time series, consisting of several offline intervals with different distributions within a fixed offline time horizon, and an online interval that continuously receives new samples. For non-stationary time series, the data distribution in the current online interval may have appeared in previous offline intervals. We theoretically explore the feasibility of applying knowledge from offline intervals to the current online interval. To this end, we propose the Mixture of Online and Offline Experts (MOOE). MOOE learns static offline experts from offline intervals and maintains a dynamic online expert for the current online interval. It then adaptively combines the offline and online experts using a meta expert to make predictions for the samples received in the online interval. Specifically, we focus on theoretical analysis, deriving parameter convergence, regret bounds, and generalization error bounds to prove the effectiveness of the algorithm.

Mixture of Online and Offline Experts for Non-stationary Time Series

Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and content-disrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel “Similarity Disentangle Mask”, which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a “Similarity Reweighting” mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method across benchmarks.

Training-and-Prompt-Free General Painterly Harmonization via Zero-Shot Disentenglement on Style and Content References

Sleep staging is important for monitoring sleep quality and diagnosing sleep-related disorders. Recently, numerous deep learning-based models have been proposed for automatic sleep staging using polysomnography recordings. Most of them are trained and tested on the same labeled datasets which results in poor generalization to unseen target domains. However, they regard the subjects in the target domains as a whole and overlook the individual discrepancies, which limits the model's generalization ability to new patients (i.e., unseen subjects) and plug-and-play applicability in clinics. To address this, we propose a novel Source-Free Unsupervised Individual Domain Adaptation (SF-UIDA) framework for sleep staging, leveraging sequential cross-view contrasting and pseudo-label based fine-tuning. It is actually a two-step subject-specific adaptation scheme, which enables the source model to  effectively adapt to newly appeared unlabeled individual without access to the source data. It meets the practical needs in real-world scenarios, where the personalized customization can be plug-and-play applied to new ones. Our framework is applied to three classic sleep staging models and evaluated on three public sleep datasets, achieving the state-of-the-art performance.

Personalized Sleep Staging Leveraging Source-free Unsupervised Domain Adaptation

Human-object interaction (HOI) detection aims to detect the spatial positions of human-object pairs and recognize their interactions. Existing single-branch, two-branch, and three-branch methods are challenging to make an appropriate trade-off on efficiency, multi-task decoupling, and collaborative learning, while they fail to identify rare and complex interaction categories effectively as well. In this work, we propose a novel Efficient Mamba-based Disentangled Progressive Learning (HOIMamba) for HOI Detection to absorb the advantages of the existing three approaches and adaptively aggregate multi-level interaction semantics guided by cross-task bidirectional information contexts. Specifically, HOIMamba builds an efficient and effective decoder through cascaded Low-Rank Adaptations (LoRAs), with high efficiency, thorough decoupling of tasks, and good multi-task collaborative learning. Furthermore, to alleviate the recognition problem of interactions in difficult HOI samples, a novel Mamba-based comprehensive progressive learning strategy with Cross-enhance Mamba (CEM) blocks and Detection Context Propagation (DCP) blocks is designed to gradually excavate interaction-related discriminative cues from four levels. CEM blocks automatically aggregate context to generate diverse task-shared semantics and simultaneously realize the cross-task interaction between human and object branches, guiding the interaction branch to extract more expressive HOI representation. DCP blocks further transfer the comprehensive interaction context to human and object branches to achieve rich and effective information exchange, facilitating the model to discover more HOI instances. Extensive experimental results on two standard benchmarks demonstrate the effectiveness of our HOIMamba.

HOIMamba: Efficient Mamba-based Disentangled Progressive Learning for HOI Detection

When dealing with multi-view data, the heterogeneity of data attributes across different views often leads to label ambiguity. To effectively address this challenge, this paper designs a Multi-View Partial-Label Learning (MVPLL) framework, where each training instance is described by multiple view features and associated with a set of candidate labels, among which only one is correct. The key to deal with such problem lies in how to effectively fuse multi-view information and accurately disambiguate these ambiguous labels. In this paper, we propose a novel approach named CFDM, which explores the consistency and complementary of multi-view data by multi-view contrastive fusion and reduces label ambiguity by multi-class contrastive prototype disambiguation. Specifically, we first extract view-specific representations using multiple view-specific autoencoders, and then integrate multi-view information through both inter-view and intra-view contrastive fusion to enhance the distinctiveness of these representations. Afterwards, we utilize these distinctive representations to establish and update prototype vectors for each class within each view. Based on these, we apply contrastive prototype disambiguation to learn global class prototypes and accordingly reduce label ambiguity. In our model, multi-view contrastive fusion and multi-class contrastive prototype disambiguation are conducted mutually to enhance each other within a coherent framework, leading to a more ideal classification performance. Experimental results on multiple datasets have demonstrated that our proposed method is superior to other state-of-the-art methods.

CFDM: Contrastive Fusion and Disambiguation for Multi-View Partial-Label Learning

This paper considers the problem of *Multi-Hop Video Question Answering (MH-VidQA)* in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to mine multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, **MultiHop-EgoQA**, through meticulous manual verification and refinement. Our experiments reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as **GeLM**, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens. Once trained on our constructed visual instruction data, **GeLM** demonstrates enhanced multi-hop grounding and reasoning capabilities, establishing a new baseline for this challenging task. Furthermore, when trained on third-view videos, the same architecture also achieves state-of-the-art performance on the existing single-hop VidQA benchmark, ActivityNet-RTL, showing the architecture's effectiveness.

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification and limited amount of labels make the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end to end student teacher based framework which benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module which learns from students’ mistakes on labeled samples and transfer this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses does not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency which leads to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks, UCF101 24, JHMDB21, AVA and Youtube VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101 24, 16% on JHMDB21, and, 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101 24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube VOS for video object segmentation demonstrating its generalization capability to other tasks in the video domain. We will make the code and models publicly available.

Stable Mean Teacher for Semi-supervised Video Action Detection

Functional Magnetic Resonance Imaging (fMRI) data is a widely used kind of four-dimensional biomedical data, which requires effective compression. However, fMRI compressing poses unique challenges due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series by employing several methods, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and (iii) using proper initialization together with nonlinear fusion to describe the inter-region similarity. This scheme appropriately incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity. The source code will be released upon acceptance of the paper.

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

Scene Graph Generation (SGG) aims to detect all objects and identify their pairwise relationships existing in the scene. Considering the substantial human labor costs, existing scene graph annotations are often sparse and biased, which result in confusion training with low-frequency predicates. In this work, we design a Semi-Supervised Clustering framework for Scene Graph Generation (SSC-SGG) that uses the sparse labeled data to guide the generation of effective pseudo-labels from unlabeled object pairs, thus enriching the labeled sample space, especially for low-frequency interaction samples. We approach from the perspective of clustering, reducing the problem of confirmation bias in a self-training manner. Specifically, we first enhance the model's robustness to feature extraction via prototype-based clustering, aggregating different relationship augmented features onto the same prototype. Secondly, we design a dynamic pseudo-label assignment algorithm based on a mini-batch, which adjusts the detection sensitivity to different frequency samples from the historical assignment. Finally, we conduct joint training on the pseudo-labels and the labeled data. We conduct experiments on various SGG models and achieve substantial overall performance improvements, demonstrating the effectiveness of SSC-SGG.

Premium content

Next from AAAI 2025

MSP-MVS: Multi-Granularity Segmentation Prior Guided Multi-View Stereo

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES