United States

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.

AAAI 2025

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

technical paper

We are pleased to announce the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25), which will be held in Philadelphia, Pennsylvania at the Pennsylvania Convention Center from February 25 to March 4, 2025.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

### [Invited Speakers](https://aaai.org/conference/aaai/aaai-25/aaai-25-invited-speakers/)

Register [here](https://aaai.org/conference/aaai/aaai-25/registration/)

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.



We design sensitivity oracles for error-prone networks.
For a network problem $\Pi$, the data structure
preprocesses a network $G=(V,E)$ and a sensitivity parameter $f$ such that,
for any set $F \subseteq V\cup E$ of up to $f$ link or node failures,
it is able to quickly report a solution for $\Pi$ in $G{-}F$.
We study the following problems $\Pi$.
* $L$-Hop Shortest Path: Given $s,t \in V$, is there a shortest $s$-$t$-path in $G{-}F$ with at most $L$ links? 
* $k$-Path: Does $G{-}F$ contain a simple path with $k$ links?
* $k$-Clique: Does $G{-}F$ contain a clique of $k$ nodes?

Our main technical contribution is an new construction of $(L,f)$-replacement path coverings ($(L,f)$-RPC) in the parameter realm where $f = o(\log L)$.
An  $(L,f)$-RPC is a family $\mathcal{G}$ of 
subnetworks of $G$ which, for every $F \subseteq E$ with $|F| \le f$, 
contain a subfamily $\mathcal{G}_F \subseteq \mathcal{G}$ such that (i) every subnetwork in $\mathcal{G}_F$ contains no link of $F$ and (ii) for each $s,t \in V$, if $G{-}F$ contains a shortest $s$-$t$ path with at most $L$ links, then some subnetworks in $\mathcal{G}_F$ retains at least one of such paths. Our $(L, f)$-RPC has almost the same size as the one by Weimann and Yuster but it improves the query time to access $\mathcal{G}_F$ from $\widetilde{O}(f^2L^f)$ to $\widetilde{O}(f^{\frac{5}{2}} L^{o(1)})$. 
It also improves  both the size and query time of the $(L,f)$-RPC by Karthik and Parter (2021) by nearly a factor of $L$.
We then derive oracles for $L$-Hop Shortest Path, $k$-Path, and $k$-Clique from this. Notably, our solution for $k$-Path improves the query time of the one by Bilò et al. (2022) for $f=o(\log k)$.

Efficient Fault-Tolerant Search by Fast Indexing of Subnetworks

Multiple cameras can provide comprehensive multi-view video coverage of a person. Fusing this multi-view data is crucial for tasks like behavioral analysis, although it traditionally requires camera calibration—a process that is often complex. Moreover, previous studies have overlooked the challenges posed by self-occlusion under multiple views and the continuity of human body shape estimation.
In this study, we introduce a method to reconstruct the 3D human body from multiple uncalibrated camera views. Initially, we utilize a pre-trained human body encoder to process each camera view individually, enabling the reconstruction of human body models and parameters for each view along with predicted camera positions.
Rather than merely averaging the models across views, we develop a neural network trained to assign weights to individual views for all human body joints, based on the estimated distribution of joint distances from each camera.
Additionally, we focus on the mesh surface of the human body for dynamic fusion, allowing for the seamless integration of facial expressions and body shape into a unified human body model.
Our method has shown excellent performance in reconstructing the human body on two public datasets, advancing beyond previous work from the SMPL model to the SMPL-X model. This extension incorporates more complex hand poses and facial expressions, enhancing the detail and accuracy of the reconstructions. Crucially, it supports the flexible ad-hoc deployment of any number of cameras, offering significant potential for various applications.

MUC: Mixture of Uncalibrated Cameras for Robust 3D Human Body Reconstruction

Training generally capable agents in complex environments is a challenging task that involves identifying "right" environments at the training stage. Recent research has highlighted the potential of the Unsupervised Environment Design framework, which generates environment instances/levels adaptively at the frontier of the agent’s capabilities using regret measures. While regret approaches have shown great promise in generating feasible environments, they can produce difficult environments that are challenging for an RL agent to learn from. This is because regret represents the best-case (upper bound) learning potential and not the actual learning potential of an environment. To address this limitation, we propose an alternative mechanism that employs marginal benefit, focusing on the improvement (in terms of generalized performance) the agent policy gets for a given environment. The advantage of this new mechanism is that it is agent-focused (and not environment focused) and generates the "right" environments depending on the agent's policy. Additionally, to improve the generalizability of the agent, we introduce representative state diversity metric that aims to generate varied experiences for the agent. Finally, we provide detailed experimental results and ablation analysis to showcase the effectiveness of our new methods. We obtain SOTA results among RL based environment generation methods.

Marginal Benefit Driven RL Teacher for Unsupervised Environment Design

We adopt a parametric approach to analyze the worst-case degradation in social welfare when the allocation of indivisible goods is constrained to be fair. Specifically, we are concerned with cardinality-constrained allocations, which require that each agent has at most k items in their allocated bundle. We propose the notion of the price of cardinality, which captures the worst-case multiplicative loss of utilitarian or egalitarian social welfare resulting from imposing the cardinality constraint. We then characterize tight or almost-tight bounds on the price of cardinality as exact functions of the instance parameters, demonstrating how the social welfare improves as k is increased. In particular, one of our main results refines and generalizes the existing asymptotic bound on the price of balancedness, as studied by Bei et al. (2021). We also further extend our analysis to the problem where the items are partitioned into disjoint categories, and each category has its own cardinality constraint. Through a parametric study of the price of cardinality, we provide a framework which aids decision makers in choosing an ideal level of cardinality-based fairness, using their knowledge of the potential loss of utilitarian and egalitarian social welfare.

The (Exact) Price of Cardinality for Indivisible Goods: A Parametric Perspective

Text-based knowledge graph completion methods take advantage of pre-trained language models (PLM) to enhance intrinsic semantic connections of raw triplets with detailed text descriptions. Typical methods in this branch map an input query (textual descriptions associated with an entity and a relation) and its candidate entities into feature vectors, respectively, and then maximize the probability of valid triples. These methods are gaining promising performance and increasing attention for the rapid development of large language models. According to the property of the language models, the more related and specific context information the input query provides, the more discriminative the resultant embedding will be. In this paper, through observation and validation, we find a neglected fact that the relation-aware neighbors of the head entities in queries could act as effective contexts for more precise link prediction. Driven by this finding, we propose a relation-aware anchor enhanced knowledge graph completion method (RAA-KGC). Specifically, in our method, to provide reference of what might the target entity be like, we first generate anchor entities within the relation-aware neighborhood of the head entity. Then, by pulling the query embedding towards the neighborhoods of the anchors, it is tuned to be more discriminative for target entity matching. The results of our extensive experiments not only validate the efficacy of RAA-KGC but also reveal that by integrating our relation-aware anchor enhancement strategy, the performance of current leading methods can be notably enhanced without substantial modifications. The source code of the paper will be released after the acceptance of the paper.

Knowledge Graph Completion with Relation-Aware Anchor Enhancement

Social norms are standards of behaviour common in a society. However, when agents make decisions without considering how others are impacted, norms can emerge that lead to the subjugation of certain agents. We present RAWL-E, a method to create ethical norm-learning agents. RAWL-E agents operationalise maximin, a fairness principle from Rawlsian ethics, in their decision-making processes to promote ethical norms by balancing societal well-being with individual goals. We evaluate RAWL-E agents in simulated harvesting scenarios. We find that norms emerging in RAWL-E agent societies enhance social welfare, fairness, and robustness, and yield higher minimum experience compared to those that emerge in agent societies who do not implement Rawlsian ethics..

Operationalising Rawlsian Ethics for Fairness in Norm Learning Agents

Early exiting is an effective paradigm for improving the inference efficiency of pre-trained language models (PLMs) by dynamically adjusting the number of executed layers for each sample. However, in most existing works, easy and hard samples are treated equally by each classifier during training, which neglects the test-time early exiting behavior, leading to inconsistency between training and testing. Although some methods have tackled this issue under a fixed speed-up ratio, the challenge of flexibly adjusting the speed-up ratio while maintaining consistency between training and testing is still under-explored. To bridge the gap, we propose a novel Consistency-Oriented Signal-based Early Exiting (COSEE) framework, which leverages a calibrated sample weighting mechanism to enable each classifier to emphasize the samples that are more likely to exit at that classifier under various acceleration scenarios. Extensive experiments on the GLUE benchmark demonstrate the effectiveness of our COSEE across multiple exiting signals and backbones, yielding a better trade-off between performance and efficiency.

COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism

Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision.
However, future superhuman models will surpass the capability of humans.
Therefore, humans will only be able to weakly supervise superhuman models.
This expected deficiency of human evaluation would weaken the safety of future AI systems.
Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue.
In this paper, we attempt to combine the strengths of these two approaches to further improve alignment.
Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision.
To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model?
We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model.
We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model.
We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate.
Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

Debate Helps Weak-to-Strong Generalization

Many critical business and societal decisions in areas such as supply chain and healthcare involve numerous potential actions, complex constraints, and goals that can be modeled as objective functions. Mathematical optimization, a core area in Operations Research (OR), provides robust, mathematically grounded methodologies to address such decisions and has shown tremendous benefits in many applications. However, its application requires the creation of accurate and efficient optimization models, necessitating rare expertise and considerable time, creating a barrier to widespread adoption in decision-making. Thus, it is a long-standing goal to make these capabilities widely accessible.

The advent of Large Language Models (LLMs) has made advanced Artificial Intelligence (AI) capabilities widely accessible through natural language. LLMs can accelerate expert work in creating formal models like computer programs, and emerging research indicates they can also speed up the development of optimization models by OR experts. We, therefore, propose integrating and advancing LLM and optimization modeling to empower organizational decision-makers to model and solve such complex problems without requiring deep expertise in optimization.

In this work, we present our vision for democratizing optimization modeling for organizational decision-making by such a combination of LLMs and optimization modeling. We identify a set of fundamental requirements for the vision's implementation and describe the state of the art through a literature survey and some experimentation. We show that a) LLMs already provide substantial novel capabilities relevant to realizing this vision, but that b) major research challenges remain to be addressed. We also propose possible research directions to overcome these gaps. We would like this work to serve as a call to action to bring together the LLM and OR optimization modeling communities to pursue this vision, thereby enabling much more widespread improved decision-making and increasing by orders of magnitude the benefits AI and OR can bring to enterprises and society.

Enhancing Decision Making Through the Integration of Large Language Models and Operations Research Optimization

AI has immense potential for positive social impact, including in domains ranging from conservation to health. However, it can be challenging to account for human collaborations and real-world uncertainties when deploying such systems, which can lead to critical errors. Therefore, my research focuses on developing new methods in multi-agent systems and machine learning, including methods for participatory design of AI, human-AI collaboration, and uncertainty quantification, to develop safe, impactful AI systems, particularly in the domains of water conservation and reproductive health.

Premium content

Next from AAAI 2025

Efficient Fault-Tolerant Search by Fast Indexing of Subnetworks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES