For nearly a decade now, the problem that has been top of mind for me is how we might enable AI systems to robustly optimize for what people want, and to avoid causing harm – from robots and self-driving cars, to assistive devices and deep brain stimulation, to theory and toy models, to large language models and now Gemini. In this talk, I’ll take the opportunity to share a bit about my journey in this space, what lessons I’ve learned, and how we’re approaching the safety and alignment of frontier models at Google DeepMind.

My Journey in AI Safety and Alignment

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their \textit{partial return}. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon \textit{regret}. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, $\widehat{A^*_r}$, \textit{not} a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of $\widehat{A^*_r}$ is less desirable than the appropriate and simpler approach of greedy maximization of $\widehat{A^*_r}$. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

Learning Optimal Advantage from Preferences and Mistaking It for Reward

When inferring reward functions from human behavior (be it demonstrations, comparisons, physical corrections, or e-stops), it has proven useful to model the human as making noisy-rational choices, with a ``rationality coefficient" capturing how much noise or entropy we expect to see in the human behavior. Prior work typically sets the rationality level to a constant value, regardless of the type, or quality, of human feedback. However, in many settings, giving one type of feedback (e.g. a demonstration) may be much more difficult than a different type of feedback (e.g. answering a comparison query). Thus, we expect to see more or less noise or suboptimality depending on the type of human feedback. In this work, we advocate that grounding the rationality coefficient in real data for each feedback type, rather than assuming a default value, has a significant positive effect on reward learning. We test this in both simulated experiments and in a user study with real human feedback. We find that overestimating human rationality can have dire effects on reward learning accuracy and regret. We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases. Further, we find that the rationality level affects the informativeness of each feedback type: surprisingly, demonstrations are not always the most informative---when the human acts very suboptimally, comparisons actually become more informative, even when the rationality level is the same for both.  Ultimately, our results emphasize the importance and advantage of paying attention to the assumed human-rationality-level, especially when agents actively learn from multiple types of human feedback.

The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types

In classic instruction following, language like ``I'd like the JetBlue flight'' maps to actions (e.g., selecting that flight). However, language also conveys information about a user's underlying reward function (e.g., a general preference for JetBlue), which can allow a model to carry out desirable actions in new contexts. We present a model that infers rewards from language pragmatically: reasoning about how speakers choose utterances not only to elicit desired actions, but also to reveal information about their preferences. On a new interactive flight--booking task with natural language, our model more accurately infers rewards and predicts optimal actions in unseen environments, in comparison to past work that first maps language to actions (instruction following) and then maps actions to rewards (inverse reinforcement learning).

Inferring Rewards from Language in Context

Artificial agents trained by deep reinforcement learning will likely encounter novel situations after deployment that were never seen during training. Our agent must be 
 to handle such situations well. However, if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of 
 in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in 
 and 
 and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward. For our full paper, see https://arxiv.org/abs/2101.05507 arxiv.org/abs/2101.05507

Evaluating the Robustness of Collaborative Agents

**Title:** My Journey in AI Safety and Alignment **Abstract:** For nearly a decade now, the problem that has been top of mind for me is how we might enable AI systems to robustly optimize for what people want, and to avoid causing harm – from robots and self-driving cars, to assistive devices and deep brain stimulation, to theory and toy models, to large language models and now Gemini. In this talk, I’ll take the opportunity to share a bit about my journey in this space, what lessons I’ve learned, and how we’re approaching the safety and alignment of frontier models at Google DeepMind.

Keynote: Anca Dragan

keynote

## Welcome to EMNLP 2024! 
We are excited to welcome you to one of the most prominent conferences in the field of Natural Language Processing. This year, EMNLP 2024 is being held in a hybrid format,
offering both virtual and in-person participation in beautiful Miami. Due to a record-breaking number of submissions, we've expanded the total number of accepted papers to accommodate more cutting-edge research from around the globe.
### [Conference Handbook](https://drive.google.com/file/d/1WPROgxjLAC96AJL7Ugy0tEnYm7dkrbHt/view?usp=sharing)

You are required to register for this event. **Please register [here](https://2024.emnlp.org/registration/).** The EMNLP 2024 event page on Underline will be open to public one week prior to the event.

Please register!

EMNLP 2024

EMNLP 2024 will take place in Miami, Florida from Nov 12th to Nov 16th, 2024, at the Hyatt Regency Miami Hote and on Underline for remote participants.

Main Track - HAI

technical paper

We are pleased to announce the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24), which will be held in Vancouver, British Columbia at the Vancouver Convention Centre – West Building from 20-27 February, 2024.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-24 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.

We expect for AAAI-24 to be an in-person conference – one author of all accepted papers will be expected to present work in person unless there are exceptional circumstances that prevent this.

In order to access the AAAI-24 event page you need to register [here](https://aaai.org/aaai-conference/registration/)

AAAI 2024

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. 

Posters & Demos IV

poster

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-23 is the Thirty-Seventh AAAI Conference on Artificial Intelligence. The theme of this conference is to create collaborative bridges within and beyond AI. Like previous AAAI conferences, AAAI-23 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and two new activities: a Bridge Program and a Lab Program. Many of these activities are tailored to the theme of bridges and all are selected according to the highest standards, with additional programs for students and young researchers. 
AAAI is providing you with a conference planner, which you can use to help organize your itinerary of activities. This includes talks to attend in person, talks to attend remotely, breaks with colleagues and your site seeing activities. To access this conference planner, please go to [https://aaai-2023.takemobi.io](https://aaai-2023.takemobi.io).

In order to access this site, you need to register. If you haven't already, please register [here](https://aaai.org/Conferences/AAAI-23/registration/).


AAAI 2023

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines.

Posters: Language Grounding, Speech and Multimodality

# Welcome everyone to ACL 2022!

The 60th Annual Meeting of the Association for Computational Linguistics is taking place May 22-27, 2022 as a hybrid event, in Dublin and online. We are happy to welcome all of you to this anniversary edition with an almost 50-50 in-person and virtual participation. 
The main conference program features oral presentations, in-person and virtual posters and demo sessions, a plenary session for our best paper presentations and awards, three amazing keynote events and two new initiatives of invited talks: Spotlight Talks for Young Rising Stars (STIRS) and The Next Big Idea Talks. Posters (including Findings of ACL 2022) and demos are grouped by areas for both the in-person and the virtual sessions. For the virtual component, the talks will be on Zoom and the posters and the demos will be in GatherTown. The Student Research Workshop will have an oral session and a poster session as part of Poster Session 1. The program also features eight Tutorials and 28 Workshops. 

 
We wish you a wonderful conference! 
[**The ACL 2022 Organizing Committee**](https://www.2022.aclweb.org/organisers)
 
[**Conference Handbook**](https://drive.google.com/file/d/1_BUCMfhMVrjG9E2e71aHdHeE28KSje0l/view?usp=sharing) 
[**Mini Handbook**](https://drive.google.com/file/d/1qlBKl0wzmlVF1oCeMQl3BahLd9nLP5Ce/view?usp=sharing) 
[**Posters and Demo guides**](https://drive.google.com/file/d/1UucMAoCNncIOaH1rMMDa0owuG9qgvJTG/view?usp=sharing)

ACL 2022

The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational problems involving human language, a field often referred to as either computational linguistics or natural language processing (NLP). 

Poster/Demo Session 4 

AAMAS is the largest and most influential conference in the area of agents and multiagent systems, bringing together researchers and practitioners in all areas of agent technology and providing and internationally renowned high-profile forum for publishing and finding out about the latest developments in the field. AAMAS is the flagship conference of the non-profit International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS).  

![](https://assets.underline.io/uploads/markdown_image/1/image/ce4604bd54468a006164d16e080204fb.jpg)

Please click on the button below to register for this event.

If you are registered and see this page please check the account email you are logged in with. It must be the same as for the registration.

If you still have trouble accessing the contact please contact support at aamas2021@underline.io 

Anca Dragan

6

20

SHORT BIO

Presentations

My Journey in AI Safety and Alignment

Learning Optimal Advantage from Preferences and Mistaking It for Reward

The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types

Inferring Rewards from Language in Context

Evaluating the Robustness of Collaborative Agents

Evaluating the Robustness of Collaborative Agents

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES