United States

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.

AAAI 2025

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

synthesis

computational photography

video

image

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Learning-based image dehazing methods have achieved remarkable results on synthetic haze datasets. However, due to the lack of real paired haze/clean data and robust features, most existing dehazing methods may leave residual haze and produce unrealistic colors in real-world haze condition. In this paper, we propose a novel Structure Guided Dehazing Network (SGDN) that operates in both RGB and YCbCr color spaces for real-world dehazing. Our SGDN comprises two meticulously designed key modules: the Bi-Color Cooperative Bridge (BCB) and the Guidance Enhancement Module (GEM). Specifically, the BCB consists of the Phase Integration Module (PIM) and the Interactive Attention Module (IAM), which exploit the rich texture features of the YCbCr space to guide the RGB space to recover clearer features in both frequency and spatial domains.  To ensure the tonal consistency of the images, the GEM is proposed that further enhances the color perception of RGB features by aggregating the channel information of YCbCr features. As a result, our method surpasses existing state-of-the-art methods across multiple real-world smoke/haze datasets.  Finally, for effective supervised learning, we create a new trainable dataset called the Real-World Well-Aligned Haze Dataset (RWAHD), which includes a variety of scenes from different geographical regions and climate conditions.

Guided Real Image Dehazing Using YCbCr Color Space

Few-shot learning (FSL) aims to recognize new concepts using a limited number of visual samples. Existing approaches attempt to incorporate semantic information into the limited visual data for category understanding. However, these methods often enrich class-level feature representations with abstract category names, failing to capture the nuanced features essential for effective generalization. To address this issue, we propose a novel framework for FSL, which incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs), to enhance the representation of the class prototypes. Specifically, our framework composes a Semantic-guided Visual Pattern Extraction (SVPE) module and a Prototype-Calibration (PC) module, where the SVPE meticulously extracts semantic-aware visual patterns across diverse scales, while the PC module seamlessly integrates these patterns to refine the visual prototype, enhancing its representativeness. Extensive experiments on four few-shot classification benchmarks and the BSCD-FSL cross-domain benchmarks showcase remarkable advancements over the current state-of-the-art methods. Notably, for the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an impressive average improvement of 1.95% over the second-best competitor.

Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial for effective knowledge sharing and enabling replication and protection of intellectual property. However, the automation of this task has largely been overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ∼355K patent images along with their brief and detailed textual descriptions extracted from 60K+ US patent documents. Further, we propose PatentLMM, a novel multimodal large language model specifically tailored for generating high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multi-modal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multi-modal models. PatentDesc-355K and PatentLMM pave the way for automating patent figure understanding, enabling efficient knowledge sharing, and faster drafting of patent documents. We make the code and data publicly available.

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Variational autoencoder performs well in community detection on static networks, but it is difficult to directly extend to continuous dynamic networks. The main reason is that traditional methods mainly rely on adjacency structures to complete the inference and generation processes. However, continuous dynamic networks cannot be described by this structure because the inherent timeliness and causality information of the network would be lost. To address this issue, we propose a novel variational autoencoder, CT-VAE, for community detection in continuous dynamic networks, along with its scalable variant, CT-CAVAE. By conceptualizing node interactions as event streams and adopting the Hawkes process to capture temporal dynamics and causality, and incorporating them into the inference process, CT-VAE can effectively extend the traditional inference approach to continuous dynamic networks. Additionally, in the generation phase, CT-VAE combines pseudo-labeling and compact constraint strategies to facilitate the reconstruction process of non-adjacent structures. For the scalable variant, CT-CAVAE, end-to-end community detection is achieved by cleverly combining Gaussian mixture distribution. Extensive experimental results demonstrate that the proposed CT-VAE and CT-CAVAE achieve more favorable performance compared with the state-of-the-art baselines.

Community-Aware Variational Autoencoder for Continuous Dynamic Networks

Recent face forgery detection methods based on disentangled representation learning utilize paired images for cross-reconstruction, aiming to extract forgery-relevant attributes and forgery-irrelevant content. However, there still exist the following issues that may comprise the detector performance: 1) using information-dense images as the decoupling targets increases the decoupling difficulty; 2) the extracted attribute features are reconstruction-irrelevant rather than forgery-relevant, and single-scale forgery representation decoupling cannot capture sufficient discriminative information; 3) the generalization performance of decoupled attribute features is poor as the detector focuses on learning specific artifact types in the training set. To address these issues, we propose a novel disentangled representation learning framework for deepfake detection. First, we extract features by partitioning the dense information within the image, focusing independently on texture, color, or edges. These features are then used as the decoupling targets rather than the images themselves, which could mitigate the decoupling difficulty. Second, we extend reconstruction loss from image-level to feature-level, thus extending the forgery representation decoupling from single-scale to multi-scale. Third, we propose a critical forgetting mechanism that forces the detector to forget the most salient features during training, which correspond to specific forgery artifact types in the training set. Extensive experimental results validate the efficacy of the proposed method.

Critical Forgetting-Based Multi-Scale Disentanglement for Deepfake Detection

This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering.

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical dense memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical dense memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning

Recently, deep learning based methods have revolutionized the remote sensing image segmentation.However, these methods usually rely on a predefined semantic class set, thus needs additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic class. In this work, we introduce Open-Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic class in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models. Technically, GSNet consists of a Dual Stream Image Encoder (DSIE), a Query-Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi-source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The proposed dataset and method will be made publicly available.

Towards Open-Vocabulary Remote Sensing Image Semantic Sementation

Face retouching aims to remove facial imperfections from image and videos while at the same time preserving face attributes. The existing methods are designed to perform non-interactive end-to-end retouching, while the ability to interact with users is highly demanded in downstream applications. In this paper, we propose RetouchGPT, a novel framework that leverages Large Language Models (LLMs) to guide the interactive retouching process. Towards this end, we design an instruction-driven imperfection prediction module to accurately identify imperfections by integrating textual and visual features. To learn imperfection prompts, we further incorporate a LLM-based embedding module to fuse multi-modal conditioning information. The prompt-based feature modification is performed in each transformer block, such that the imperfection features are suppressed and replaced with the features of normal skin progressively. Extensive experiments have been performed to verify effectiveness of our design elements and demonstrate that RetouchGPT is a useful tool for interactive face retouching and achieves superior performance over state-of-the-arts.

RetouchGPT: LLM-based Interactive High-Fidelity Face Retouching via Imperfection Prompting

Hyperdimensional computing (HDC) is an approach for solving cognitive information processing and a variety of learning tasks using data represented as high-dimensional vectors. The technique has a rigorous mathematical backing, and is easy to implement in energy-efficient and highly parallel hardware like FPGAs and "processing-in-memory'' architectures. The success of HDC based machine learning approaches is heavily dependent on the mapping of raw data to high-dimensional space. In this work, we propose NysHD, a new method for constructing this mapping that is based on the Nyström method from the literature on kernel approximation. Our approach provides a simple recipe to turn any user-defined positive-semidefinite similarity function into an equivalent mapping in HDC. There is a vast literature on the design of such functions for learning problems.  Our approach provides a mechanism to import them into the HDC setting, expanding the types of problems that can be tackled using HDC. Empirical evaluation against existing HDC encoding methods shows that NysHD can achieve, on average, 11% and 17% better classification accuracy on graph and string datasets respectively.

Premium content

Next from AAAI 2025

Guided Real Image Dehazing Using YCbCr Color Space

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES