United States

Spatial contexts, such as the backgrounds and surroundings, are considered critical in Human-Object Interaction (HOI) recognition, especially when the instance-centric foreground is blurred or occluded. Recent advancements in HOI detectors are usually built upon detection transformer pipelines. While such an object-detection-oriented paradigm shows promise in localizing objects, its exploration of spatial context is often insufficient for accurately recognizing human actions. To enhance the capabilities of object detectors for HOI detection, we present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. In the context branch, we train the model to extract informative spatial context without requiring additional hand-craft background labels. Furthermore, we introduce context-aware spatial and semantic supervision to the context branch to filter out irrelevant noise and capture informative contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICO-$ambiguous$, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by ContextHOI, especially in recognizing interactions involving occluded or blurred instances.

AAAI 2025

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

scene analysis understanding

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Generating high-quality whole-body human object interaction motion sequences is becoming increasingly important in various fields such as animation, VR/AR, and robotics. The main challenge of this task lies in determining the level of involvement of each hand given the complex shapes of objects in different sizes and their different motion trajectories, while ensuring strong grasping realism and guaranteeing the coordination of movement in all body parts. Contrasting with existing work, which either generates human interaction motion sequences without detailed hand grasping poses or only models a static grasping pose, we propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences within a single diffusion model. To guide our network in perceiving the object's spatial position and learning more natural grasping poses, we introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance. Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences. We will release our code upon acceptance.

DiffGrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Diffusion models have demonstrated remarkable synthesis quality and  diversity in generating co-speech gestures. However, the computationally intensive sampling steps associated with diffusion models hinder their practicality in real-world applications. Hence,  we present DIDiffGes, for a Decoupled Semi-Implicit Diffusion model-based framework, that can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our approach leverages Generative Adversarial Networks (GANs) to enable large-step sampling for diffusion model. We decouple gesture data into body and hands distributions and further decompose them into marginal and conditional distributions. GANs model the marginal distribution implicitly, while L2 reconstruction loss learns the conditional distributions exciplictly. This strategy enhances GAN training stability and ensures expressiveness of generated full-body gestures. Our framework also learns to denoise root noise conditioned on local body representation, guaranteeing stability and realism. DIDiffGes can generate gestures from speech with just 10 sampling steps, without compromising quality and expressiveness, reducing the number of sampling steps by a factor of 100 compared to existing methods. Our user study reveals that our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness.

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech

Nighttime Semantic Segmentation (NSS) is essential to many cutting-edge vision applications. However, existing technologies overly rely on massive labeled data, whose annotation is time-consuming and laborious. In this paper, we pioneer a new task focusing on exploring the potential of training strategy and framework design with limited annotation to achieve a high-performance Nighttime Semantic Segmentation. Insufficient information at very low labeling budgets can easily lead to under-optimization or overfitting of the model. Our solution comprises two main components: i) a novel region-based active sampling strategy called Contextual-Aware Region Query (CARQ), which identifies highly informative target nighttime regions for labeling; and ii) an innovative Fragmentation Synergy Active Domain Adaptation framework (FS-ADA), which progressively broadcasts the limited annotation to the unlabeled regions, achieving high performance with a minimal annotation budget. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art UDA-NSS and ADA-SS methods across four day-to-nighttime benchmarks, while with strong generalization to foggy, rainy, and snowy scenes. In particular only with 1% target nighttime data annotation, our method is on par with the mainstream fully-supervised methods.

The Parables of the Mustard Seed and the Yeast: Extremely Low-Budget, High-Performance Nighttime Semantic Segmentation

Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning,  which allows the model to focus more on intrinsic attributes through intra-consistency  (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning.  To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.

CustomContrast: A Multilevel Contrastive Perspective for Subject-Driven Text-to-Image Customization

Geo-entity resolution involves linking records that refer to the same entities across different spatial datasets, which underpins location-based services. Given the varying quality and reliability of geospatial data, this task is known to be challenging, as directly comparing the semantic-centric representations of two entities is no longer reliable. To robustify geo-entity resolution in this context, the main research question is how to effectively extend the current semantics-centric representations of geo-entity with geographical context from its spatial neighbors. Existing methods consider names from neighbor entities, but they struggle to fully utilize the unaligned neighbor attributes. In this paper, we investigate the representation of geographical context for robust geo-entity resolution and propose two adaptations that efficiently leverage unaligned geo-entity attributes across spatial neighbors: (1) A plugin module, namely Unaligned Message-Passing layer (UMP), that propagates unaligned neighbor features to integrate geo-context into the token embeddings output by language model; (2) a contextualized pretraining framework (CP) that allows the former to leverage unlabelled geo-entity data. Experiments show that our method surpasses the baseline methods, achieving higher F1 scores on 8 real-world geo-datasets in terms of robustness, with an improvement of up to 7.9%. The ablation study further justifies our proposal.

Unaligned Message-Passing and Contextualized-Pretraining for Robust Geo-Entity Resolution

Existing learning-based stereo image codec adopt sophisticated transformation with simple entropy models derived from single image codecs to encode latent representations. However, those entropy models struggle to effectively capture the spatial-disparity characteristics inherent in stereo images, which leads to suboptimal rate-distortion results. In this paper, we propose a stereo image compression framework, named CAMSIC. CAMSIC independently transforms each image to latent representation and employs a powerful decoder-free Transformer entropy model to capture both spatial and disparity dependencies, by introducing a novel content-aware masked image modeling (MIM) technique. Our content-aware MIM facilitates efficient bidirectional interaction between prior information and estimated tokens, which naturally obviates the need for an extra Transformer decoder. Experiments show that our stereo image codec achieves state-of-the-art rate-distortion performance on two stereo image datasets Cityscapes and InStereo2K with fast encoding and decoding speed.

CAMSIC: Content-aware Masked Image Modeling Transformer for Stereo Image Compression

While existing semi-supervised object detection (SSOD) methods perform well in general scenes, they encounter challenges in handling oriented objects in aerial images. 
We experimentally find three gaps between general and oriented object detection in semi-supervised learning: 
1) Sampling inconsistency: the common center sampling is not suitable for oriented objects with larger aspect ratios when selecting positive labels from labeled data. 
2) Assignment inconsistency: balancing the precision and localization quality of oriented pseudo-boxes poses greater challenges which introduces more noise when selecting positive labels from unlabeled data. 
3) Confidence inconsistency: there exists more mismatch between the predicted classification and localization qualities when considering oriented objects, affecting the selection of pseudo-labels. 
Therefore, we propose a Multi-clue Consistency Learning (MCL) framework to bridge gaps between general and oriented objects in semi-supervised detection. 
Specifically, considering various shapes of rotated objects, the Gaussian Center Assignment is specially designed to select the pixel-level positive labels from labeled data. 
We then introduce the Scale-aware Label Assignment to select pixel-level pseudo-labels instead of unreliable pseudo-boxes, which is a divide-and-rule strategy suited for objects with various scales. 
The Consistent Confidence Soft Label is adopted to further boost the detector by maintaining the alignment of the predicted results. 
Comprehensive experiments on DOTA-v1.5 and DOTA-v1.0 benchmarks demonstrate that our proposed MCL can achieve state-of-the-art performance in the semi-supervised oriented object detection task.

Multi-clue Consistency Learning to Bridge Gaps Between General and Oriented Object in Semi-supervised Detection

Multi-source domain adaptation (MSDA), which utilizes multiple source domains to align the distribution of a single target domain, is a popular and challenging setting in domain adaptation (DA). However, existing MSDA approaches are difficult to obtain sufficient target domain knowledge, which serve as the transfer object. Furthermore, the target distributions are confused in the real world, i.e., the model cannot obtain the domain labels of target domains. To tackle these problems, we consider a more realistic DA setting Multi-Source Blended-Target Domain Adaptation (MBDA) and propose an Invertible Projection and Conditional Alignment (IPCA) method. Specifically, to reduce the impact of the distribution discrepancy, we construct an invertible projection for the source and blended-target domains. Then, we adopt a projection consistency regularization to our model, which makes the model more robust on the domain-specific parts. In addition, because the labels of the blended-target domain are unseen, we introduce conditional discrepancy to obtain the domain-level discriminative information and guide the classifier to serve as the discriminator, which is suitable for MBDA settings. Extensive experiment results on the ImageCLEF-DA, Office-Home, and DomainNet datasets validate the effectiveness of our method.

Invertible Projection and Conditional Alignment for Multi-source Blended-target Domain Adaptation

Visual prompt tuning-based continual learning (CL) methods have shown promising performance in exemplar-free scenarios, where their key component can be viewed as a prompt generator. Existing approaches generally rely on freezing old prompts, slow updating and task discrimination for prompt generators to preserve stability and minimize forgetting. In contrast, we introduce a novel approach that trains a consistent prompt generator to ensure stability during CL. Consistency means that for any instance from an old task, its corresponding instance-ware prompt generated by the prompt generator remains consistent even as the generator continually updates in a new task. This ensures that the representation of a specific instance remains stable across tasks and thereby prevents forgetting. We employ a mixture of experts (MoE) as the prompt generator, which contains a router and multiple experts. By deriving conditions sufficient to achieve the consistency for the MoE prompt generator, we demonstrate that: during training in a new task, if the router and experts update in the directions orthogonal to the subspaces spanned by old input features and gating vectors, respectively, the consistency can be theoretically guaranteed. To implement this orthogonality, we project parameter gradients to those orthogonal directions using the orthogonal projection matrices computed via the null space method. Extensive experiments on four class-incremental learning benchmarks validate the effectiveness and superiority of our approach. Our code is available in the supplementary material.

Training Consistent Mixture-of-Experts-Based Prompt Generator for Continual Learning

This paper presents a comprehensive study on the role of Classifier-Free Guidance (CFG) in text-conditioned diffusion models from the perspective of inference efficiency. In particular, we relax the default choice of applying CFG in all diffusion steps and instead propose to search for more efficient guidance policies. We formulate the discovery of such policies in the framework of differentiable neural architecture search. Our findings suggest that, as denoising progresses, the updates produced by CFG become increasingly aligned with simple conditional steps, which renders CFG's additional neural network evaluation redundant, especially in the second half of the denoising process. Building upon this insight, we propose "Adaptive Guidance" (AG), an efficient variant of CFG that adaptively omits network evaluations when the denoising process displays convergence. Our experiments demonstrate that AG preserves CFG's image quality while reducing computation by 25%. Thus, AG constitutes a plug-and-play alternative to Guidance Distillation, achieving 50% of the speed-ups of the latter, while being training-free and retaining the capacity to handle negative prompts. We conclude by uncovering further redundancies of CFG in the first half of the diffusion process, showing that entire neural network evaluations can be replaced by simple affine transformations of past score estimates.

Premium content

Next from AAAI 2025

DiffGrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES