United States

Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. 
However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. 
Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. 
Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception.
In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. 
NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. 
Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples.
These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. 
Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.

AAAI 2025

Enhancing Fine-grained Vision-Language Pretraining with Negative Augmented Samples

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



The recent studies show that Large Language Models (LLMs) often fall short in tasks demanding creative, lateral thinking due to lacking a clear awareness of their own reasoning processes. To cope with this issue, we propose a novel metacognitive prompting method (titled as MP) by mimicking human metacognition. Through integrating metacognitive principles, MP endows LLMs with lateral thinking ability, thereby enhancing their abilities to strategize, monitor, and reflect on their responses when dealing with creative tasks. The experimental results with five base LLMs across three lateral thinking datasets demonstrate that: All LLMs armed with MP consistently outperform the representative baseline methods. For example, MP demonstrates superior performance over CoT prompting across Sentence Puzzle (+5.00\%), Word Puzzle (+10.07\%), BiRdQA (+6.48\%), and RiddleSense (+2.65\%) with GPT-3.5-turbo model. In particular, the deployment of MP with GPT-4 achieves significant performance improvements that even surpass human performance on BRAINTEASER benchmark, demonstrating the transformative potential of MP in enhancing the creative problem-solving abilities of LLMs.

MP: Endowing Large Language Models with Lateral Thinking

Learning representations from numerous 2D image data has shown promising performance, yet very few works apply this representations to point cloud registration. In this paper, we explore how to leverage the 2D information to assist the point cloud registration, and propose IAPReg, an Image-Assisted Partial 3D point cloud Registration framework with the multi-view images generated by the input point cloud. It is expected to enrich 3D information with 2D knowledge, and leverage 2D knowledge to assist with point cloud registration. Specifically, we create multi-view depth maps by projecting the input point cloud from several specific views, and then extract 2D and 3D features using some well-established models. To fuse the information learned from 2D and 3D modalities, inter-modality multi-view learning module is proposed to enhance geometric information and complement semantic information. Weighted SVD is a common method to reduce the impact of inaccurate correspondences on registration. However, determining the correspondence weights is not trivial. Therefore, we design a 2D-weighted SVD method, where the 2D knowledge is employed to provide weight information of correspondences. Extensive experiments perform that our method outperform the state-of-the-art method without additional 2D training data. Our code will be released soon.

Partial Point Cloud Registration with Multi-view 2D Image Learning

Auto-regressive models have made significant progress in the realm of text-to-image synthesis, yet devising an appropriate model architecture and training strategy to achieve a satisfactory level remains an important avenue of exploration. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information—freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation.  The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details.  Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Few-Shot Segmentation (FSS) aims to segment target regions in unlabeled query images based on labeled support images. Recent work leveraged Segment Anything Model (SAM) for FSS by constructing prompts that encode query pixels similar to prototypes encapsulating support foreground pixels. While effective, it overlooked that the similarity between prototypes and pixels is unreliable and that the quality of conventional pseudo-mask to enhance the query foreground-specific information is low. To address these issues, we propose Foreground-Covering Prototype Generation and Matching, which constructs both support and query prototypes with an attention-based pseudo-mask and matches prototypes to generate more reliable prompts. Specifically, our approach utilizes two types of complementary features to construct prototypes: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. For the query prototype generation, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate reliable prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS.

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Black-box prompt tuning has become a prevalent parameter-efficient paradigm that leverages the capabilities of large language models (LLMs) for customized applications in specific downstream tasks.
In practical scenarios, downstream tasks frequently involve data distributions that are heavily imbalanced. Such imbalances tend to impair prompt performance, causing severe performance collapse in minority classes.
Conducting effective imbalanced black-box prompt tuning to mitigate the adverse effects of imbalanced data distribution on prompt performance remains a significant challenge.
In this paper, we propose black-box prompt tuning with first and zeroth order gradient (BPT-FZG) for handling the imbalanced data.
Specifically, BPT-FZG introduces AUC maximization as the objective for prompt tuning and equivalently formulates it as a nonconvex-concave saddle point problem to avoid the construction of sample pairs from opposite classes. 
Indeed, BPT-FZG optimizes the latent representation of the continuous prompt in the low-dimensional subspace with AUC loss and leverages the first and zeroth order gradients alternately to update the parameters.
Furthermore, we establish the theoretical convergence guarantee for BPT-FZG under common assumptions, showing that our method can find a stationary point of the objective function.
Our experiments on RoBERTa-large, GPT2-XL, and Llama3 show that BPT-FZG achieves significant improvement on both constructed and real-world imbalanced datasets, emphasizing the effectiveness of our methods.

Leveraging First and Zeroth-Order Gradient to Address Imbalanced Black-box Prompt Tuning via Minimax Optimization

The rapidly developing Large Vision Language Models (LVLMs) still face the \textit{hallucination phenomena} where the generated responses do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by human experts or proprietary models). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a detection model which can perform sentence-level hallucination detection. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for hallucination mitigation training. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) which prioritizes the mitigation of critical hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments on hallucination detection and mitigation benchmarks demonstrate that our method sets a new state-of-the-art in hallucination detection on MHaluBench, surpassing GPT-4V and Gemini, and reduces the hallucination rate by 36.1\% on AMBER and 76.3\% on Object HalBench compared to the base model.

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Current methods for time series forecasting struggle in the online scenario, since it is difficult to preserve long-term dependency while adapting short-term changes when data are arriving sequentially. Although some recent methods solve this problem by controlling the updates of latent states, they cannot disentangle the long/short-term states, leading to the inability to effectively adapt to nonstationary. To tackle this challenge, we propose a general framework to disentangle long/short-term states for online time series forecasting. Our idea is inspired by the observations where short-term changes can be led by unknown interventions like abrupt policies in the stock market. Based on this insight, we formalize a data generation process with unknown interventions on short-term states. Under mild assumptions, we further leverage the independence of short-term states led by unknown interventions to establish the identification theory to achieve the disentanglement of long/short-term states. Built on this theory, we develop a \textbf{L}ong \textbf{S}hort-\textbf{T}erm \textbf{D}isentanglement model (\textbf{LSTD}) to extract the long/short-term states with long/short term encoders, respectively. Furthermore, the \textbf{LSTD} model incorporates a smooth constraint to preserve the long-term dependencies and an interrupted dependency constraint to enforce the forgetting of short-term dependencies, together boosting the disentanglement of long/short-term states. Experimental results on several benchmark datasets show that our \textbf{LSTD} model outperforms existing methods for online time series forecasting, validating its efficacy in real-world applications.

Disentangling Long-Short Term State Under Unknown Interventions for Online Time Series Forecasting

Generative adversarial networks (GANs) have emerged as a powerful tool for generating high-fidelity data. However, the main bottleneck of existing approaches is the lack of supervision on the generator training, which often results in undamped oscillation and unsatisfactory performance. To address this issue, we propose an algorithm called Monte Carlo GAN (MCGAN). This approach, utilizing an innovative generative loss function, termly the regression loss, reformulates the generator training as a regression task and enables the generator training by minimizing the mean squared error between the discriminator's output of real data and the expected discriminator of fake data. We demonstrate the desirable analytic properties of the regression loss, including discriminability and optimality, and show that our method requires a weaker condition on the discriminator for effective generator training. These properties justify the strength of this approach to improve the training stability while retaining the optimality of GAN by leveraging strong supervision of the regression loss. Extensive experiments on diverse datasets, including image data (CIFAR-10/100, FFHQ256 and ImageNet), time series data (VAR and stock data) and video data, are conducted to demonstrate the flexibility and effectiveness of our proposed MC-GAN. Numerical results show that the proposed MCGAN is versatile in enhancing a variety of backbone GAN models and achieves consistent and significant improvement in terms of quality, accuracy, training stability, and learned latent space.

MCGAN: Enhancing GAN Training with Regression-Based Generator Loss

Lane detection plays a crucial role in autonomous driving systems, enabling vehicles to navigate safely and efficiently in complex environment. Despite significant advancements in recent years, accurate lane detection remains a challenging task, particularly in scenarios with occlusions, ambiguous lane markings, and diverse lighting conditions. In this paper, we propose the Global Enhancement and Optimization Network (GEONet) for lane detection, which is designed to refine both feature extraction and global feature transmission. Traditional approaches typically depend on deep convolutional layer stacks for global feature extraction, a process that often compromises inference speed and the precision of global feature representation. In contrast, GEONet introduces a novel and more effective methodology. We present the Global Feature Extraction Module (GFEM), which is specifically engineered to capture comprehensive global features with higher accuracy. Additionally, we introduce the Top-Tier Supplementary Module (TTSM), which enhances these features through a bottom-up approach, improving overall lane detection accuracy. To further bolster our framework, we incorporate Whitening Batch Normalization (WBN) and Whitening Contrastive Learning (WCL), which enhance feature robustness and ensure better generalization. In addition to our novel network design, we propose two new loss functions to enhance lane detection accuracy. The Generalized Rectangular Intersection over Union (GRIoU) Loss extends the predicted points into rectangles, optimizing overlap and smoothness of lane predictions.The Angle Loss accounts for angular differences between predicted and ground truth lanes, improving alignment and continuity. Experimental results demonstrate that our proposed method significantly outperforms current state-of-the-art lane detection techniques. Our codes are available at: https://anonymous.4open.science/r/Anonymous-GitHub-GEONet/.

GEONet: Global Enhancement and Optimization Network for Lane Detection

Fatigue is a critical factor contributing to accidents in industries such as safety monitoring and engineering construction. Fatigue exhibits dynamic complexity and non-stationary characteristics, so there are many intermediate states of short-term variation between alert and fatigue. Capturing and learning the signs of these intermediate states is essential for accurate fatigue assessment. However, current fatigue detection methods primarily rely on coarse-grained labels, typically spanning minutes to hours, and commonly treat alert and fatigue as two distinctly separate distributions, overlooking the expression of intermediate states and oversimplifying the rich distribution information of fatigue types and levels, thereby limiting detection effectiveness. To address these, this paper explores a refined representation of fatigue in terms of three dimensions: time, type, and level, and proposes a Multi-Dimensional Fine-Grained Modeling for Fatigue Detection (MDFG). This introduces the SmallLoss to extract trustworthy samples, utilizes clustering to identify diverse subtypes under alert and fatigued states, and establishes base class sets in each state. Subsequently, a complete base class set containing intermediate state bases is constructed using the base class synthesis method, which achieves the expression of intermediate fatigue states from absence to presence. Finally, fatigue levels are quantified based on the matching between samples and the complete base class set. Moreover, to cope with the complex variability of fatigue states, MDFG employs meta-learning for training. MDFG achieves an Average accuracy improvement of 10.0% and 12.1% on two real datasets compared to methods that do not consider fine-grained information. Extensive experiments demonstrate that the MDFG exhibits superior robustness and stability among current fatigue detection methods.

Premium content

Next from AAAI 2025

MP: Endowing Large Language Models with Lateral Thinking

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES