United States

In recent years, significant progress has been achieved in No-Reference Point Cloud Quality Assessment (NR-PCQA) research. However, existing methods mostly seek a direct mapping function from visual data to the Mean Opinion Score (MOS), which is contradictory to the mechanism of practical subjective evaluation. In response to this challenge, we propose a novel language-driven PCQA method named CLIP-PCQA. On the one hand, considering that human beings prefer to describe visual quality using discrete quality descriptions (e.g., ``excellent&quot; and ``poor&quot;) rather than specific scores, we adopt a retrieval-based mapping strategy to simulate the process of subjective assessment. More specifically, based on the philosophy of CLIP, we calculate the cosine similarity between the visual feature and multiple textual features corresponding to different quality descriptions, in which process an effective contrastive loss and learnable prompts are introduced to enhance the feature extraction. Meanwhile, given the personal limitations and bias in subjective experiments, we further covert the feature similarities into probabilities and consider the Opinion Score Distribution (OSD) rather than a single MOS as the final target. Experiment results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches.

AAAI 2025

CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

low level physics based vision

In recent years, significant progress has been achieved in No-Reference Point Cloud Quality Assessment (NR-PCQA) research. However, existing methods mostly seek a direct mapping function from visual data to the Mean Opinion Score (MOS), which is contradictory to the mechanism of practical subjective evaluation. In response to this challenge, we propose a novel language-driven PCQA method named CLIP-PCQA. On the one hand, considering that human beings prefer to describe visual quality using discrete quality descriptions (e.g., ``excellent" and ``poor") rather than specific scores, we adopt a retrieval-based mapping strategy to simulate the process of subjective assessment. More specifically, based on the philosophy of CLIP, we calculate the cosine similarity between the visual feature and multiple textual features corresponding to different quality descriptions, in which process an effective contrastive loss and learnable prompts are introduced to enhance the feature extraction. Meanwhile, given the personal limitations and bias in subjective experiments, we further covert the feature similarities into probabilities and consider the Opinion Score Distribution (OSD) rather than a single MOS as the final target. Experiment results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Automatic Speech Recognition (ASR) systems are pivotal in transcribing speech into text, yet the errors they introduce can significantly degrade the performance of downstream tasks like summarization. This issue is particularly pronounced in clinical dialogue summarization, a low-resource domain where supervised data for fine-tuning is scarce, necessitating the use of ASR models as black-box solutions. Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose \frameworkname{}, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). Specifically, we leverage the in-context learning capabilities of LLMs and instruct them to generate ASR-like errors based on a few available medical dialogue examples with audio recordings. Experimental results show that LLMs can effectively model ASR noise, and incorporating this noisy data into the training process significantly improves the robustness and accuracy of medical dialogue summarization systems. This approach addresses the challenges of noisy ASR outputs in critical applications, offering a robust solution to enhance the reliability of clinical dialogue summarization.

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

The capability of In-Context Learning (ICL) is crucial for large language models to generalize across a wide range of tasks. By utilizing prompts, these models can accurately predict outcomes for previously unseen tasks without necessitating retraining. However, this generalization ability does not extend to the length of the inputs; the effectiveness of ICL likely diminishes with excessively long inputs, resulting in errors in the generated text. To investigate this issue, we propose a study using a dataset of In-Context functions to understand the operational mechanisms of Transformer models in ICL and length generalization. We generated data using regression and Boolean functions and employed meta-learning techniques to endow the model with ICL capabilities. Our experimental results indicate that position encodings can significantly mitigate length generalization issues, with the most effective encoding extending the maximum input length to over eight times that of the original training length. However, further analysis revealed that while position encoding enhances length generalization, it compromises the model's inherent capabilities, such as its ability to generalize across different data types. Overall, our research illustrates that position encodings have a pronounced positive effect on length generalization, though it necessitates a careful trade-off with data generalization performance.

How Do Position Encodings Affect Length Generalization? Case Studies On In-Context Function Learning

Due to the density inconsistency and distribution difference between cross-source point clouds, previous methods fail in cross-source point cloud registration. We propose a density-robust feature extraction and matching scheme to achieve robust and accurate cross-source registration. To address the density inconsistency between cross-source data, we introduce a density-robust encoder for extracting density-robust features. To tackle the issue of challenging feature matching and few correct correspondences, we adopt a loose-to-strict matching pipeline with a ``loose generation, strict selection'' idea. Under it, we employ a one-to-many strategy to loosely generate initial correspondences. Subsequently, high-quality correspondences are strictly selected to achieve robust registration through sparse matching and dense matching. On the challenging Kinect-LiDAR scene in the cross-source 3DCSR dataset, our method improves the feature matching recall by 63.5 percentage points (pp) and registration recall by 57.6 pp. It also achieves the best performance on 3DMatch, while maintaining robustness under diverse downsampling densities. The code will be released soon.

Cross-PCR: A Robust Cross-Source Point Cloud Registration Framework

Recent studies have highlighted the fragility of safety alignment in large language models (LLMs), revealing that fine-tuning on task-specific datasets mixing even a few harmful instructions can significantly compromise their safety. Existing methods to counteract fine-tuning attacks typically require substantial computational resources. Even with parameter-efficient techniques like LoRA, gradient updates remain essential. To address these challenges, we propose $N$euron-$L$evel $S$afety $R$ealignment ($NLSR$), a training-free framework that restores the safety of LLMs based on the similarity difference of safety-critical neurons before and after fine-tuning. The core of our framework is first to construct a safety reference model from an initially aligned model to amplify safety-related features in neurons. We then utilize this reference model to identify safety-critical neurons, which we prepare as patches. Finally, we selectively restore only those neurons that exhibit significant similarity differences by transplanting these prepared patches, thereby minimally altering the fine-tuned model. Extensive experiments demonstrate significant safety enhancements in fine-tuned models across a range of downstream tasks, while greatly maintaining task-level accuracy. Our findings suggest regions of some safety-critical neurons show noticeable differences after fine-tuning, which can be effectively corrected by transplanting neurons from the reference model without requiring additional training.

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

The self-attention mechanism has been adopted in various popular message passing neural networks (MPNNs), enabling the model to adaptively control the amount of information that flows along the edges of the underlying graph. Such attention-based MPNNs (Att-GNNs) have also been used as a baseline for multiple studies on explainable AI (XAI) since attention has steadily been seen as natural model interpretations, while being a viewpoint that has already been popularized in other domains (e.g., natural language processing and computer vision). However, existing studies often use naive calculations to derive attribution scores from attention, undermining the potential of attention as interpretations for Att-GNNs. In our study, we aim to fill the gap between the widespread usage of Att-GNNs and their potential explainability via attention. To this end, we propose GAtt, edge attribution calculation method for self-attention MPNNs based on the computation tree, a rooted tree that reflects the computation process of the underlying model. Despite its simplicity, we empirically demonstrate the effectiveness of GAtt in three aspects of model explanation: faithfulness, explanation accuracy, and case studies by using both synthetic and real-world benchmark datasets. In all cases, the results demonstrate that GAtt greatly improves edge attribution scores, especially compared to the previous naive approach. We intend to make our source code publicly available after acceptance.

Faithful and Accurate Self-Attention Attribution for Message Passing Neural Networks via the Computation Tree Viewpoint

Tabular data poses unique challenges due to its heterogeneous nature, combining both continuous and categorical variables. Existing approaches often struggle to effectively capture the underlying structure and relationships within such data. We propose GFTab (Geodesic Flow Kernels for Semi-Supervised Learning on Mixed-Variable Tabular Dataset), a semi-supervised framework specifically designed for tabular datasets. GFTab incorporates three key innovations: 1) Variable-specific corruption methods tailored to the distinct properties of continuous and categorical variables, 2) A Geodesic flow kernel based similarity measure to capture geometric changes between corrupted inputs, and 3) Tree-based embedding to leverage hierarchical relationships from available labeled data. To rigorously evaluate GFTab, we curate a comprehensive set of 21 tabular datasets spanning various domains, sizes, and variable compositions. Our experimental results show that GFTab outperforms existing ML/DL models across many of these datasets, particularly in settings with limited labeled data.

Geodesic Flow Kernels for Semi-Supervised Learning on Mixed-Variable Tabular Dataset

Unmanned Aerial Vehicle Object Detection (UAVOD) presents unique challenges due to varying altitudes, dynamic backgrounds, and the small size of objects. Traditional detection methods often struggle with these challenges, as they typically rely on visual feature only and fail to extract the semantic relations between the objects. To address these limitations, we propose a novel approach named Self-Prompting Analogical Reasoning (SPAR). Our method utilizes the vision-language model (CLIP) to generate context-aware prompts based on image feature, providing rich semantic information that guides analogical reasoning. SPAR includes two main modules: self-prompting and analogical reasoning. Self-prompting module based on learnable description and CLIP-text encoder generates context-aware prompt by combining specific image feature; then an objectness prompt score map is produced by computing the similarity between pixel-level features and context-aware prompt. With this score map, multi-scale image features are enhanced and pixel-level features are chosen for graph construction. While for analogical reasoning module, graph nodes consists of category-level prompt nodes and pixel-level image feature nodes. Analogical inference is based graph convolution. Under the guidance of category-level nodes, different-scale object features  have been enhanced, which helps achieve more accurate detection of challenging objects. Extensive experiments illustrate  that SPAR outperforms traditional methods, offering a more robust and accurate solution for UAVOD.

Self-Prompting Analogical Reasoning for UAV Object Detection

The infrequent occurrence of overfitting in deep neural networks is perplexing: seemingly at odds with theoretical predictions, expanding models typically only enhances performance in practical applications. However, what if overfitting does occur, albeit confined to specific sub-regions of the data space? Here, we introduce a novel score that captures the forgetting rate of deep models on validation data. We posit that this score quantifies local overfitting: a decline in performance confined to certain regions of the data space. We then show empirically that local overfitting occurs regardless of the presence of traditional overfitting. Using the framework of deep over-parametrized linear models, we offer a certain theoretical characterization of forgotten knowledge, and show that it correlates with knowledge forgotten by real deep models. Finally, we devise a new ensemble method that aims to recover forgotten knowledge, relying solely on the training history of a single network. This method enhances the performance of any trained model without incurring additional training costs. Extensive empirical evaluations demonstrate the efficacy of our method across multiple datasets, contemporary neural network architectures, and training protocols.

On Local Overfitting and Forgetting in Deep Neural Networks

Language steganography in social networks primarily focuses on embedding secret information into social media text efficiently to achieve covert communication. The misuse of such techniques could pose significant potential threats to public cyberspace, such as the spread of malicious code, commands, or viruses. Existing social text steganalysis techniques mainly focus on the analysis of individual social media texts. However, the information content in a single text is very limited, leading to poor detection performance in practical applications. To address this challenge, this paper proposes a social text steganalysis method that combines large-scale language models with common-sense knowledge graphs (STLC-KG). This method first uses knowledge graphs to expand the knowledge contained in the text under investigation, enriching its linguistic expression, and then utilizes large-scale language models to extract the linguistic features of the social text. The results of tests conducted on three mainstream social media platforms demonstrate that the proposed method significantly improves the performance of social text steganalysis.

STLC-KG:A Social Text Steganalysis Method Combining Large-Scale Language Models and Common-Sense Knowledge Graphs

Local intrinsic dimension (LID) estimation methods have received a lot of attention in recent years thanks to the progress in deep neural networks and generative modeling. In opposition to old non-parametric methods, new methods use generative models to approximate diffused dataset density and scale the methods to high-dimensional datasets like images. In this paper, we investigate the recent state-of-the-art parametric LID estimation methods from the perspective of the Wiener process. We explore how these methods behave when their assumptions are not met. We give an extended mathematical description of those methods and their error as a function of the probability density of the data.

Premium content

Next from AAAI 2025

MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES