Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Two key capabilities of language models (LMs) include encoding prior knowledge about entities, which enables them to answer queries like "What's the official language of Austria?", and adapting to new information provided in-context, e.g., "Pretend the official language of Austria is Tagalog." In this work, we present the family of targeted persuasion scores (TPS), designed to measure how persuasive a context is to an LM. Compared to evaluating persuasiveness based on a model's decoded answer to a query, the TPS family of measures offers a more fine-grained view of model behavior. Based on the Wasserstein distance, the TPS family of measures captures how much a context can shift a model from its original answer distribution toward a target answer distribution and, furthermore, can flexibly adapt to leverage relationships between possible answers for more meaningful measures. Empirically, we demonstrate that analyzing model behavior with the TPS can reveal more subtle aspects of model behavior that would otherwise remain hidden when only observing a model's decoded answer, e.g., in how contradictory in-context information influences a model. Through the TPS, we offer a way to more carefully measure the effect that a context has on a language model.