United States

Human mesh recovery (HMR) is crucial in many computer vision applications; from health to entertainment, among others. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given $2D$ image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible $3D$ reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce $\textbf{GenHMR}$, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the $2D \rightarrow 3D$ mapping process. GenHMR comprises two key components: (1) $\textbf{a pose tokenizer}$ to convert $3D$ human poses into a sequence of discrete tokens in a latent space, and (2) $\textbf{an image-conditional masked transformer}$ to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with the randomly masked token sequence. During $\textit{inference}$, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing $3D$ reconstruction uncertainties. To further refine the reconstruction, a $2D$ pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected $3D$ body mesh to align with the $2D$ pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. The project website can be found at \url{https://anonymous-ai-model.github.io/GenHMR/}

AAAI 2025

GenHMR: Generative Human Mesh Recovery

motion tracking

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Learning-based methods have become increasingly popular in 3D indoor scene synthesis (ISS), showing superior performance over traditional optimization-based approaches. These learning-based methods typically model distributions on simple yet explicit scene representations using generative models. However, due to the oversimplified explicit representations that overlook detailed information and the lack of guidance from multimodal relationships within the scene, most learning-based methods have struggled with generating realistic and diverse indoor scenes. In this paper, we introduce a new method, Scene Implicit Neural Field (S-INF), for indoor scene synthesis, aiming to learn meaningful representations of multimodal relationships, in order to enhance the diversity and realism of indoor scenes. S-INF directly extracts more latent advantageous features from the entire scene in a multi-scale manner, effectively capturing multimodal relationships. Furthermore, by learning specialized scene layout relationships and projecting them into S-INF, we achieve realistic generation of scene layout. Additionally, S-INF captures dense and detailed object relationships through differentiable rendering, ensuring stylistic consistency across objects. Through extensive experiments on the benchmark 3D-FRONT dataset, we demonstrate that our method consistently achieves state-of-the-art performance under different settings for the indoor scene synthesis task.

S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field

As a fundamental method in economics and finance, the factor model has been extensively utilized in quantitative investment. In recent years, there has been a paradigm shift from traditional linear models with expert-designed factors to more flexible nonlinear machine learning-based models with data-driven factors, aiming to enhance the effectiveness of these factor models. However, due to the low signal-to-noise ratio in market data, mining effective factors in data-driven models remains challenging. In this work, we propose a hypergraph-based factor model with temporal residual contrastive learning (FactorGCL) that employs a hypergraph structure to better capture high-order nonlinear relationships among stock returns and factors. To mine hidden factors that supplement human-designed prior factors for predicting stock returns, we design a cascading residual hypergraph architecture, in which the hidden factors are extracted from the residual information after removing the influence of prior factors. Additionally, we propose a temporal residual contrastive learning method to guide the extraction of effective and comprehensive hidden factors by contrasting stock-specific residual information over different time periods. Our extensive experiments on real stock market data demonstrate that FactorGCL not only outperforms existing state-of-the-art methods but also mines effective hidden factors for predicting stock returns.

FactorGCL: A Hypergraph-Based Factor Model with Temporal Residual Contrastive Learning for Stock Returns Prediction

Personalized image generation has made significant strides in adapting content to novel concepts. However, a persistent challenge remains: balancing the accurate reconstruction of unseen concepts with the need for editability according to the prompt, especially when dealing with the complex nuances of facial features. In this study, we delve into the temporal dynamics of the text-to-image conditioning process, emphasizing the crucial role of stage partitioning in introducing new concepts. We present PersonaMagic, a stage-regulated generative technique designed for high-fidelity face customization. Using a simple MLP network, our method learns a series of embeddings within a specific timestep interval to capture face concepts. Additionally, we develop a Tandem Equilibrium mechanism that adjusts self-attention responses in the text encoder, balancing text description and identity preservation, leading to improvements in both areas. Extensive experiments confirm the superiority of PersonaMagic over state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, its robustness and flexibility are validated in non-facial domains, making it a valuable plug-in for enhancing the performance of pretrained personalization models.

PersonaMagic: Stage-Regulated High-Fidelity Face Customization with Tandem Equilibrium

The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB-D cameras playing a crucial role in this improvement. However, collecting an RGB-D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high-precision depth estimation algorithms can eliminate the dependence on RGB-D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB-PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB-D segmentation methods. In addition, the pre-trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi-modal diffusion-based segmentation methods remain unexplored. Therefore, we present a \textbf{Pseudo Depth Diffusion Model (PDDM)} that adopts a large-scale text-image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB-D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state-of-the-art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB-D.

PDDM: Pseudo Depth Diffusion Model for RGB-PD Semantic Segmentation Based in Complex Indoor Scenes

Source-free unsupervised domain adaptation aims to eliminate domain shifts when data from the source domain and annotation from the target domain are not available.
The multi-object detection tasks in medical image analysis are constrained by patient privacy and extremely huge annotation consumption. Hence, Source-free UDA is considered a more practical approach for eliminating the domain gap. However, relevant research that explores this topic is a dearth.
In this paper, we design an Anatomy-aware Alignment Teacher-Student learning method using topological consistency based on a mean-teacher framework for Source-free UDA in multiple medical object detection named AATS, including Unsupervised Structure Refinement (USR) and Graph-aware Morphology Alignment (GMA).
To match the student and teacher at the low-level and visual features, we propose the USR via an unsupervised clustering algorithm to group organs in ultrasound images.
Based on USR, we obtain a graph with organ relations on the teacher branch.
While in the student branch, we acquire visual features to construct graphical space and optimize the model with graph propagation. 
Finally, to match the student and teacher, GMA is designed to align the teacher and student based on both topology and morphology information that is derived from prior medical knowledge.
Four groups of adaptation experiments were conducted on publicly available medical datasets, and the outcomes demonstrate that our approach not only achieves state-of-the-art performance but also provides substantial advantages over existing methods.
The codes and datasets will be publicly available.

Leveraging Anatomical Consistency for Multi-Object Detection in Ultrasound Images via Source-free Unsupervised Domain Adaptation

In medical image analysis, detecting multiple structures is crucial for evaluations and diagnosis but is often limited by the lack of high-quality annotations. Semi-supervised object detection emerges as a potent methodology to enhance model performance and generalization by leveraging a vast pool of unlabeled data alongside a minimal set of labeled data. A striking observation is that both unlabelled and labeled medical images contain a priori anatomical knowledge from human screening. In this work, we introduce a novel semi-supervised approach named Semi-akmm for mining and matching anatomical knowledge in ultrasound images. We develop an Adaptive Prior Knowledge Transfer (APKT) module to mine and explore the distribution and knowledge of potential proposal boxes by proposal proportion constraint. Furthermore, within a teacher-student learning framework, we put forward an Anatomical Structure Matching (ASM) module to facilitate co-learning consistent topological prior knowledge between the student and teacher models. To our knowledge, this marks the inception of an efficient semi-supervised medical multi-structure detection model. Our experiments across five publicly available ultrasound datasets demonstrate that Semi-akmm sets a new benchmark in performance with very solid results that outperform existing methods. The codes will be publicly available.

Anatomical Knowledge Mining and Matching for Semi-supervised Medical Multi-structure Detection

Visual text generation, which aims to generate photo-realistic images with coherent and well-formed scene text being rendered, has attracted widespread attention. 
Although recent works have achieved promising performance, the limited flexibility and controllability hinders their practical applications. 
We observe that different from natural objects, visual text in real scenes often has an arbitrarily shaped structure with different granularities (i.e., character, word, or line).
In this paper, we consider the modality gap between image and text, and propose a new separation and composition pipeline for flexible and controllable visual text generation from only text prompts.
At the core of our framework is a novel Hierarchical and Directional Layout representation, i.e., HDLayout, which can model the sequential and multi-granularity nature of the visual text.
Under this formulation, we are able to generate arbitrarily shaped visual text automatically. 
Extensive experiments demonstrate that our method outperforms several strong baselines in a variety of scenarios both qualitatively and quantitatively, yielding state-of-the-art performances on arbitrarily shaped visual text generation.

HDLayout: Hierarchical and Directional Layout Planning for Arbitrary Shaped Visual Text Generation

Visual Reinforcement Learning (RL) facilitates learning directly from raw images; however, the domain gap between training and testing environments frequently leads to a decline in performance within unseen environments. In this paper, we propose Fourier Guided Adaptive Adversarial Augmentation (FGA3), a novel augmentation method that maintains semantic consistency. We focus on style augmentation in the frequency domain by keeping the phase and altering the amplitude to preserve the state of the original data. For adaptive adversarial perturbation, we reformulate the worst-case problem to RL by employing adversarial example training, which leverages value loss and cosine similarity within a semantic space. Moreover, our findings illustrate that cosine similarity is effective in quantifying feature distances within a semantic space. Extensive experiments on DMControl-GB and Procgen have shown that FGA3 is compatible with a wide range of visual RL algorithms, both off-policy and on-policy, and significantly improves the robustness of the agent in unseen environments.

Fourier Guided Adaptive Adversarial Augmentation for Generalization in Visual Reinforcement Learning

Despite the advanced long-sequence modeling of Mamba, which has expanded its applications in image restoration, there remains a lack of exploration combining its strengths with the specific characteristics of JPEG image restoration, where high-frequency components are lost after the Discrete Cosine Transform (DCT). To address this, we introduce DCTMamba, a new framework designed to apply Mamba more effectively to JPEG image restoration. Specifically, our method integrates the Discrete Cosine Transform (DCT) into the Mamba to establish the sequential scanning from lower to higher frequencies, enabling the network to initially reconstruct coarse structures and progressively refine the image with more intricate details. Furthermore, recognizing the variable frequency distributions that arise from DCT transformations across different image sizes, we have developed Scale-Adaptive Normalization to manage these variations adeptly. Comprehensive experiments confirm that DCTMamba significantly outperforms existing solutions, achieving high fidelity in both coarse structures and fine details.CTMamba significantly outperforms existing solutions, achieving high fidelity
in both coarse structures and fine details.

DCTMamba: Advancing JPEG Image Restoration through Long-Sequence Modeling and Adaptive Frequency Strategy

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17$\downarrow $ vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6$\times$ acceleration improvement and 2.7$\times$ memory compression gain.

Premium content

Next from AAAI 2025

S-INF: Towards Realistic Indoor Scene Synthesis via Scene Implicit Neural Field

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES