United States

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs&#39; ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

AAAI 2025

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

snlp

multi-modal nlp

language grounding

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Lane detection plays a crucial role in autonomous driving systems, enabling vehicles to navigate safely and efficiently in complex environment. Despite significant advancements in recent years, accurate lane detection remains a challenging task, particularly in scenarios with occlusions, ambiguous lane markings, and diverse lighting conditions. In this paper, we propose the Global Enhancement and Optimization Network (GEONet) for lane detection, which is designed to refine both feature extraction and global feature transmission. Traditional approaches typically depend on deep convolutional layer stacks for global feature extraction, a process that often compromises inference speed and the precision of global feature representation. In contrast, GEONet introduces a novel and more effective methodology. We present the Global Feature Extraction Module (GFEM), which is specifically engineered to capture comprehensive global features with higher accuracy. Additionally, we introduce the Top-Tier Supplementary Module (TTSM), which enhances these features through a bottom-up approach, improving overall lane detection accuracy. To further bolster our framework, we incorporate Whitening Batch Normalization (WBN) and Whitening Contrastive Learning (WCL), which enhance feature robustness and ensure better generalization. In addition to our novel network design, we propose two new loss functions to enhance lane detection accuracy. The Generalized Rectangular Intersection over Union (GRIoU) Loss extends the predicted points into rectangles, optimizing overlap and smoothness of lane predictions.The Angle Loss accounts for angular differences between predicted and ground truth lanes, improving alignment and continuity. Experimental results demonstrate that our proposed method significantly outperforms current state-of-the-art lane detection techniques. Our codes are available at: https://anonymous.4open.science/r/Anonymous-GitHub-GEONet/.

GEONet: Global Enhancement and Optimization Network for Lane Detection

Fatigue is a critical factor contributing to accidents in industries such as safety monitoring and engineering construction. Fatigue exhibits dynamic complexity and non-stationary characteristics, so there are many intermediate states of short-term variation between alert and fatigue. Capturing and learning the signs of these intermediate states is essential for accurate fatigue assessment. However, current fatigue detection methods primarily rely on coarse-grained labels, typically spanning minutes to hours, and commonly treat alert and fatigue as two distinctly separate distributions, overlooking the expression of intermediate states and oversimplifying the rich distribution information of fatigue types and levels, thereby limiting detection effectiveness. To address these, this paper explores a refined representation of fatigue in terms of three dimensions: time, type, and level, and proposes a Multi-Dimensional Fine-Grained Modeling for Fatigue Detection (MDFG). This introduces the SmallLoss to extract trustworthy samples, utilizes clustering to identify diverse subtypes under alert and fatigued states, and establishes base class sets in each state. Subsequently, a complete base class set containing intermediate state bases is constructed using the base class synthesis method, which achieves the expression of intermediate fatigue states from absence to presence. Finally, fatigue levels are quantified based on the matching between samples and the complete base class set. Moreover, to cope with the complex variability of fatigue states, MDFG employs meta-learning for training. MDFG achieves an Average accuracy improvement of 10.0% and 12.1% on two real datasets compared to methods that do not consider fine-grained information. Extensive experiments demonstrate that the MDFG exhibits superior robustness and stability among current fatigue detection methods.

MDFG: Multi-Dimensional Fine-Grained Modeling for Fatigue Detection

Molecular representation learning is vital for various downstream applications, including the analysis and prediction of molecular properties and side effects. While Graph Neural Networks (GNNs) have been a popular framework for modeling molecular data, they often struggle to capture the full complexity of molecular representations. In this paper, we introduce a novel method called Gode, which accounts for the dual-level structure inherent in molecules. Molecules possess an intrinsic graph structure and simultaneously function as nodes within a broader molecular knowledge graph. Gode integrates individual molecular graph representations with multi-domain biochemical data from knowledge graphs. By pre-training two GNNs on different graph structures and employing contrastive learning, Gode effectively fuses molecular structures with their corresponding knowledge graph substructures. This fusion yields a more robust and informative representation, enhancing molecular property predictions by leveraging both chemical and biological information. When fine-tuned across 11 chemical property tasks, our model significantly outperforms existing benchmarks, achieving an average ROC-AUC improvement of 12.7\% for classification tasks and an average RMSE/MAE improvement of 34.4\% for regression tasks. Notably, Gode surpasses the current leading model in property prediction, with advancements of 2.2\% in classification and 7.2\% in regression tasks.

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

Most existing semi-supervised community detection algorithms leverage known communities to learn community structures, subsequently identifying communities that align with these learned community structures. However, differences in community structures may render the community structures learned by these methods inappropriate for the community containing the given node of interest. As a result, the identified community may exclude the given node or be of poor quality. Inspired by the success of reinforcement learning, we propose a Semi-supervised Local community detection method based on Reinforcement Learning, named SLRL, which only explores parts of the network surrounding the given node.  It first extracts the local structure around a given node with an extractor, followed by selecting communities that are similar to this local structure to distill useful communities. These selected communities are employed to train the expander, which expands the community containing a given node. Experimental results demonstrate that SLRL outperforms state-of-the-art algorithms on five real-world datasets.

SLRL: Semi-Supervised Local Community Detection Based on Reinforcement Learning

Location-Based Social Networks (LBSNs) offer a rich dataset of user activity at Points-of-Interest (POIs), making next POI recommendation a key task. Traditional algorithms face challenges due to broad searching scopes, affecting recommendation accuracy. Users tend to visit nearby POIs and show temporal concentration in their activities, reflecting personalized spatio-temporal clustering. However, individual user data may be insufficient to capture these clustering effects for personalized recommendations. In this paper, we propose an integrated Personalized Spatio-Temporal Clustering Model (iPCM) for next POI recommendation. The model learns this kind of personalized spatio-temporal clustering effect by using global historical trajectory data in conjunction with user feature embeddings. It integrates the features of personalized spatio-temporal clustering with the user's trajectory, and completes the user's POI recommendation through a Transformer encoding and MLP decoding. To enhance the accuracy of predictions, we add a module of probability adjustment. The experimental results on multiple datasets show that with the help of personalized spatio-temporal clustering, the proposed iPCM is superior to existing methods in various evaluation metrics.

Integrating Personalized Spatio-Temporal Clustering for Next POI Recommendation

3D single object tracking (3D SOT) in LiDAR point clouds is essential for autonomous driving. Most existing 3D SOT methods focus on clear weather, where point clouds are more defined. However, adverse weather conditions lead to sparser and noisier point clouds, significantly degrading tracking performance and posing safety risks. In this study, we introduce UAWTrack, a universal 3D SOT model designed to perform effectively across diverse real-world weather conditions. UAWTrack comprises three key modules: 1) Voxel Feature Extraction, which mitigates the perturbations in point clouds caused by adverse weather; 2) Motion-centric Spatial-temporal Aggregation and Motion-guided Feature Fusion, capturing motion clues and sampling dense BEV motion features to address the issue of sparsity; and 3) Weather-Specific Tracker, which efficiently handles tracking in various weather conditions. To fill the gap of lacking benchmarks for 3D SOT in adverse weather, we simulate physically valid adverse weather conditions on the KITTI and NuScenes datasets, creating two benchmarks: KITTI-A and NuScenes-A. Extensive experiments demonstrate that UAWTrack achieves state-of-the-art performance under all weather conditions. The code will be released.

UAWTrack: Universal 3D Single Object Tracking in Adverse Weather

Visual object tracking is essentially crucial for unmanned aerial vehicles (UAVs). Despite the substantial progress, most of the existing UAV trackers are designed for well-conditioned daytime data, while for the scenarios in challenging weather condition, e.g. foggy or nighttime environment, the tremendous domain gap leads to significant performance degradation. To address this issue, in this paper, we propose a novel robust UAV tracker termed LVPT-Track, which conducts high quality label-aligned visual prompt tuning to adapt to various challenging weather conditions. Specifically, we first synthesize the sequential foggy and nighttime video frames to assist the model training. An domain adaptive teacher-student network is utilized to distill the hierarchical visual semantic of the target objects in cross-domain scenarios. Then we propose a target-aware pseudo-label voting (PLV) strategy to alleviate the target-level misalignment in the dual domains. Furthermore, we propose a dynamic aggregated prompt (DAP) module to facilitate the significant appearance variation of the target object in challenging scenarios. Extensive experiments demonstrate that our tracker achieves superior performance over existing state-of-the-art UAV trackers.

LVPTrack: High Performance Domain Adaptive UAV Tracking with Label Aligned Visual Prompt Tuning

Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks.
In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases.
To tackle above challenges, we adopt the meta-learning strategy to teach models ``learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. 
Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task.
Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proved beneficial for training-free LVLMs.
Our code will be made available upon acceptance of the paper.

Making Large Vision Language Models to Be Good Few-Shot Learners

Unsupervised domain adaptation (UDA) refers to a domain adaptation framework in which a learning model is trained based on the labeled samples on the source domain and unlabelled ones in the target domain. The dominant existing methods in the field that rely on the classical covariate shift assumption to learn domain-invariant feature representation have yielded suboptimal performance under label distribution shift. In this paper, we propose a novel Conditional Adversarial SUpport ALignment (CASUAL) whose aim is to minimize the conditional symmetric support divergence between the source’s and target domain’s feature representation distributions, aiming at a more discriminative representation for the classification task. We also introduce a novel theoretical target risk bound, which justifies the merits of aligning the supports of conditional feature distributions compared to the existing marginal support alignment approach in the UDA settings. We then provide a complete training process for learning in which the objective optimization functions are precisely based on the proposed target risk bound. Our empirical results demonstrate that CASUAL outperforms other state-of-the-art methods on different UDA benchmark tasks under different label shift conditions.

CASUAL: Conditional Support Alignment for Domain Adaptation with Label Shift

The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges persist when detecting tiny objects. The features of tiny objects occupy a very small proportion of the feature maps. Although FPN integrates multi-scale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) containing two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies FPN lacks. Our experiments demonstrate that detector based on HS-FPN exhibits competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.

Premium content

Next from AAAI 2025

GEONet: Global Enhancement and Optimization Network for Lane Detection

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES