United States

The auditing of Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For example, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Content Warning: Please note that this paper includes examples that may be offensive.

AAAI 2025

CALM: Curiosity-Driven Auditing for Large Language Models

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Large language models (LLMs) have significantly advanced the field of automated code generation. However, a notable research gap exists in the evaluation of social biases that may be present in the code produced by LLMs. To solve this issue, we propose a novel fairness framework, i.e., Solar, to assess and mitigate the social biases of LLM-generated code. Specifically, Solar can automatically generate test cases for quantitatively uncovering social biases of the auto-generated code by LLMs. To quantify the severity of social biases in generated code, we develop a dataset that covers a diverse set of social problems. We applied Solar and the crafted dataset to four state-of-the-art LLMs for code generation. Our evaluation reveals severe bias in the LLM-generated code from all the subject LLMs. Furthermore, we explore several strategies for bias mitigation,  including Chain-of-Thought (CoT) prompting, combining positive role-playing with CoT prompting and iterative prompting. Our experiments show that iterative prompting can effectively reduce social bias in LLM-generated code by up to 90%. Solar is highly extensible to evaluate new social problems.

Bias Unveiled: Investigating Social Bias in LLM-Generated Code

Multi-objective preference alignment of large language models (LLMs) is critical for developing personalizable, helpful and harmless AI systems. However, optimizing model outputs in the presence of diverse objectives, while also allowing for varying the relative weights of these objectives at inference time presents a significant challenge. Existing approaches are either computationally expensive to train or do not exhibit sufficient steerability at inference time. This paper introduces the Multi-Objective Online DPO (MO-ODPO) algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences. Our approach incorporates a prompt conditioning mechanism, allowing for easy adjustment of user preferences at test time while only training a single policy, ensuring adaptive and personalized model performance. Experimental results on two popular benchmarks demonstrate the efficacy of MO-ODPO in achieving Pareto-optimal performance as well as good inference-time steerability over current state-of-the-art approaches.

Robust Multi-Objective Preference Alignment with Online DPO

We present a Reinforcement Learning Platform for Adversarial Black-box untargeted and targeted attacks that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact.

Reinforcement Learning Platform for Adversarial Black-box Attacks with Custom Distortion Filters

Existing work on the alignment problem has focused mainly on (1) qualitative descriptions of the alignment problem; (2) attempting to align AI actions with human interests by focusing on value specification and learning; and/or (3) focusing on a single agent or on humanity as a monolith. Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents. We address this gap by adapting a computational social science model of human contention to the alignment problem. Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals across various problem areas. Misalignment scores in our framework depend on the observed agent population, the domain in question, and conflict between agents' weighted preferences. Through simulations, we demonstrate how our model captures intuitive aspects of misalignment across different scenarios. We then apply our model to two case studies, including an autonomous vehicle setting, showcasing its practical utility. Our approach offers enhanced explanatory power for complex sociotechnical environments and could inform the design of more aligned AI systems in real-world applications.

Quantifying Misalignment Between Agents: Towards a Sociotechnical Understanding of Alignment

The advent of large language models (LLMs) has sparked significant interest in using natural language for preference learning. However, existing methods often suffer from high computational burdens, taxing human supervision, and lack of interpretability. To address these issues, we introduce MAPLE, a framework for large language model-guided Bayesian active preference learning. MAPLE leverages LLMs to model the distribution over preference functions, conditioning it on both natural language feedback and conventional preference learning feedback, such as pairwise trajectory rankings. MAPLE also employs active learning to systematically reduce uncertainty in this distribution and incorporates a language-conditioned active query selection mechanism to identify informative and easy-to-answer queries, thus reducing human burden. We evaluate MAPLE's sample efficiency and preference inference quality across two benchmarks, including a real-world vehicle route planning benchmark using OpenStreetMap data. Our results demonstrate that MAPLE accelerates the learning process and effectively improves humans' ability to answer queries.

MAPLE: A Framework for Active Preference Learning Guided by Large Language Models

Significant efforts have been made to analyze the political stance or bias in news articles, especially as political polarization intensifies over the years. Recent advancements in machine learning have enabled researchers to develop various bias prediction models, which typically learn features not only from the text of the news articles but also from external knowledge. However, when training these models, the political bias label assigned to a news article is often based solely on the news source that published it. This approach can be problematic, as a news outlet with a particular political stance might publish an article that reflects a different political perspective.

To address this issue, we first find out distinct text patterns that pertain to a particular news source (or a publisher), which are barely meaningful to predict the political bias of its news article. Then, we conduct comprehensive experiments to investigate (i) whether the prior models trained to predict the bias can also predict its source and (ii) whether the prior models change prediction results if a distinct pattern of a source with a different political stance is inserted to a news article. 
Our experimental results reveal that all the prior models are prone to predict the source even if they are trained to predict the bias.
We finally suggest a new deep learning model for political bias prediction that avoids learning the source-indicative patterns.

Political Bias Prediction Models Focus on Source Cues, Not Semantics

The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop the datasets involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages rEpresentation enGineering to annotate preferENce Datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.

LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with the complexity of managing multiple reward models. To address these issues, we propose Sequential Preference Optimization (SPO), a method that sequentially fine-tunes LLMs to align with multiple dimensions of human preferences. SPO avoids explicit reward modeling, directly optimizing the models to align with nuanced human preferences. We theoretically derive closed-form optimal SPO policy and loss function. Gradient analysis is conducted to show how SPO manages to fine-tune the LLMs while maintaining alignment on previously optimized dimensions. Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences and significantly outperforms the baselines.

Sequential Preference Optimization: Multi-Dimensional Preference Alignment with Implicit Reward Modeling

Accurate standard plane acquisition in fetal ultrasound (US)
videos is crucial for fetal growth assessment, anomaly detec-
tion, and adherence to clinical guidelines. However, manu-
ally selecting standard frames is time-consuming and prone
to intra- and inter-sonographer variability. Existing methods
primarily rely on image-based approaches that capture stan-
dard frames and then classify the input frames across dif-
ferent anatomies. This ignores the dynamic nature of video
acquisition and its interpretation. To address these chal-
lenges, we introduce Multi-Tier Class-Aware Token Trans-
former (MCAT); a visual query-based video clip localiza-
tion (VQ-VCL) method to assist sonographers by enabling
them to capture a quick ultrasound sweep. By then pro-
viding a visual query of the anatomy they wish to ana-
lyze, MCAT returns the video clip containing the standard
frames for that anatomy, facilitating thorough screening for
potential anomalies. We evaluate MCAT on two ultrasound
video datasets and a natural image VQ-VCL dataset based
on Ego4D. Our model outperforms state-of-the-art methods
by 10% and 13% mtIoU on the ultrasound datasets and by
5.35% mtIoU on the Ego4D dataset, using 96% fewer tokens.
MCAT’s efficiency and accuracy have significant potential
implications for public health, especially in low- and middle-
income countries (LMICs), where it may enhance prenatal
care by streamlining standard plane acquisition, simplifying
ultrasound-based screening and diagnosis andallowing sono-
graphers to examine more patients. The code will be available
at xxx.github.com and in supplementary material.

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Local governments around the world are making consequential decisions on behalf of their constituents, and these constituents are responding with requests, advice, and assessments of their officials at public meetings. So many small meetings cannot be covered by traditional newsrooms at scale. We propose PublicSpeak, a probabilistic framework which can utilize meeting structure, domain knowledge, and linguistic information to discover public remarks in local government meetings. We then use our approach to inspect the issues raised by constituents in 7 cities across the United States. We evaluate our approach on a novel dataset of local government meetings and find that PublicSpeak improves over state-of-the-art by 10\% on average,  with gains of up to 40\%.

Premium content

Next from AAAI 2025

Bias Unveiled: Investigating Social Bias in LLM-Generated Code

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES