United States

Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries.
SafeInfer involves two phases: the &#39;safety amplification&#39; phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the &#39;safety-guided decoding&#39; phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.  We shall release the source code and dataset in the public domain upon acceptance.

AAAI 2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries.
SafeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.  We shall release the source code and dataset in the public domain upon acceptance.

poster

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



Multi-objective preference alignment of large language models (LLMs) is critical for developing personalizable, helpful and harmless AI systems. However, optimizing model outputs in the presence of diverse objectives, while also allowing for varying the relative weights of these objectives at inference time presents a significant challenge. Existing approaches are either computationally expensive to train or do not exhibit sufficient steerability at inference time. This paper introduces the Multi-Objective Online DPO (MO-ODPO) algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences. Our approach incorporates a prompt conditioning mechanism, allowing for easy adjustment of user preferences at test time while only training a single policy, ensuring adaptive and personalized model performance. Experimental results on two popular benchmarks demonstrate the efficacy of MO-ODPO in achieving Pareto-optimal performance as well as good inference-time steerability over current state-of-the-art approaches.

Robust Multi-Objective Preference Alignment with Online DPO

We present a Reinforcement Learning Platform for Adversarial Black-box untargeted and targeted attacks that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact.

Reinforcement Learning Platform for Adversarial Black-box Attacks with Custom Distortion Filters

Existing work on the alignment problem has focused mainly on (1) qualitative descriptions of the alignment problem; (2) attempting to align AI actions with human interests by focusing on value specification and learning; and/or (3) focusing on a single agent or on humanity as a monolith. Recent sociotechnical approaches highlight the need to understand complex misalignment among multiple human and AI agents. We address this gap by adapting a computational social science model of human contention to the alignment problem. Our model quantifies misalignment in large, diverse agent groups with potentially conflicting goals across various problem areas. Misalignment scores in our framework depend on the observed agent population, the domain in question, and conflict between agents' weighted preferences. Through simulations, we demonstrate how our model captures intuitive aspects of misalignment across different scenarios. We then apply our model to two case studies, including an autonomous vehicle setting, showcasing its practical utility. Our approach offers enhanced explanatory power for complex sociotechnical environments and could inform the design of more aligned AI systems in real-world applications.

Quantifying Misalignment Between Agents: Towards a Sociotechnical Understanding of Alignment

The advent of large language models (LLMs) has sparked significant interest in using natural language for preference learning. However, existing methods often suffer from high computational burdens, taxing human supervision, and lack of interpretability. To address these issues, we introduce MAPLE, a framework for large language model-guided Bayesian active preference learning. MAPLE leverages LLMs to model the distribution over preference functions, conditioning it on both natural language feedback and conventional preference learning feedback, such as pairwise trajectory rankings. MAPLE also employs active learning to systematically reduce uncertainty in this distribution and incorporates a language-conditioned active query selection mechanism to identify informative and easy-to-answer queries, thus reducing human burden. We evaluate MAPLE's sample efficiency and preference inference quality across two benchmarks, including a real-world vehicle route planning benchmark using OpenStreetMap data. Our results demonstrate that MAPLE accelerates the learning process and effectively improves humans' ability to answer queries.

MAPLE: A Framework for Active Preference Learning Guided by Large Language Models

Significant efforts have been made to analyze the political stance or bias in news articles, especially as political polarization intensifies over the years. Recent advancements in machine learning have enabled researchers to develop various bias prediction models, which typically learn features not only from the text of the news articles but also from external knowledge. However, when training these models, the political bias label assigned to a news article is often based solely on the news source that published it. This approach can be problematic, as a news outlet with a particular political stance might publish an article that reflects a different political perspective.

To address this issue, we first find out distinct text patterns that pertain to a particular news source (or a publisher), which are barely meaningful to predict the political bias of its news article. Then, we conduct comprehensive experiments to investigate (i) whether the prior models trained to predict the bias can also predict its source and (ii) whether the prior models change prediction results if a distinct pattern of a source with a different political stance is inserted to a news article. 
Our experimental results reveal that all the prior models are prone to predict the source even if they are trained to predict the bias.
We finally suggest a new deep learning model for political bias prediction that avoids learning the source-indicative patterns.

Political Bias Prediction Models Focus on Source Cues, Not Semantics

The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop the datasets involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages rEpresentation enGineering to annotate preferENce Datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.

LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with the complexity of managing multiple reward models. To address these issues, we propose Sequential Preference Optimization (SPO), a method that sequentially fine-tunes LLMs to align with multiple dimensions of human preferences. SPO avoids explicit reward modeling, directly optimizing the models to align with nuanced human preferences. We theoretically derive closed-form optimal SPO policy and loss function. Gradient analysis is conducted to show how SPO manages to fine-tune the LLMs while maintaining alignment on previously optimized dimensions. Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences and significantly outperforms the baselines.

Sequential Preference Optimization: Multi-Dimensional Preference Alignment with Implicit Reward Modeling

Accurate standard plane acquisition in fetal ultrasound (US)
videos is crucial for fetal growth assessment, anomaly detec-
tion, and adherence to clinical guidelines. However, manu-
ally selecting standard frames is time-consuming and prone
to intra- and inter-sonographer variability. Existing methods
primarily rely on image-based approaches that capture stan-
dard frames and then classify the input frames across dif-
ferent anatomies. This ignores the dynamic nature of video
acquisition and its interpretation. To address these chal-
lenges, we introduce Multi-Tier Class-Aware Token Trans-
former (MCAT); a visual query-based video clip localiza-
tion (VQ-VCL) method to assist sonographers by enabling
them to capture a quick ultrasound sweep. By then pro-
viding a visual query of the anatomy they wish to ana-
lyze, MCAT returns the video clip containing the standard
frames for that anatomy, facilitating thorough screening for
potential anomalies. We evaluate MCAT on two ultrasound
video datasets and a natural image VQ-VCL dataset based
on Ego4D. Our model outperforms state-of-the-art methods
by 10% and 13% mtIoU on the ultrasound datasets and by
5.35% mtIoU on the Ego4D dataset, using 96% fewer tokens.
MCAT’s efficiency and accuracy have significant potential
implications for public health, especially in low- and middle-
income countries (LMICs), where it may enhance prenatal
care by streamlining standard plane acquisition, simplifying
ultrasound-based screening and diagnosis andallowing sono-
graphers to examine more patients. The code will be available
at xxx.github.com and in supplementary material.

MCAT: Visual Query-Based Localization of Standard Anatomical Clips in Fetal Ultrasound Videos Using Multi-Tier Class-Aware Token Transformer

Local governments around the world are making consequential decisions on behalf of their constituents, and these constituents are responding with requests, advice, and assessments of their officials at public meetings. So many small meetings cannot be covered by traditional newsrooms at scale. We propose PublicSpeak, a probabilistic framework which can utilize meeting structure, domain knowledge, and linguistic information to discover public remarks in local government meetings. We then use our approach to inspect the issues raised by constituents in 7 cities across the United States. We evaluate our approach on a novel dataset of local government meetings and find that PublicSpeak improves over state-of-the-art by 10\% on average,  with gains of up to 40\%.

PUBLICSPEAK: Hearing the Public with a Probabilistic Framework

Large language models (LLMs) offer a valuable technology for various applications in healthcare. However, their tendency to hallucinate and the existing reliance on proprietary systems pose challenges in environments concerning critical decision-making and strict data privacy regulations, such as healthcare, where the trust in such systems is paramount. Through combining the strengths and discounting the weaknesses of humans and AI, the field of Human-AI Collaboration (HAIC) presents one front for tackling these challenges and hence improving trust. This paper presents a novel HAIC $\textit{guided deferral}$ system that can simultaneously parse medical reports and defer uncertain predictions with intelligent guidance to humans. We develop methodology which builds efficient, effective and open-source LLMs for this purpose, for the real-world deployment in healthcare. We conduct a pilot study which showcases the effectiveness of our proposed system in practice. Additionally, we highlight drawbacks of standard calibration metrics in imbalanced data scenarios commonly found in healthcare, and suggest a simple yet effective solution: the Imbalanced Expected Calibration Error ($\mathrm{ECE_{Imb}}$). We release our code for practitioners wishing to replicate our system.

Premium content

Next from AAAI 2025

Robust Multi-Objective Preference Alignment with Online DPO

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES