Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/zfs3-1p37

poster

ACL 2024

August 13, 2024

Bangkok, Thailand

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

keywords:

role-playing

large language model

evaluation

Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce \textit{CharacterEval}, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 11,376 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. \textit{CharacterEval} employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. To facilitate the convenient evaluation for these subjective metrics in \textit{CharacterEval}, we further developed CharacterRM, a role-playing reward model based on human annotations, which has a higher correlation with human judgment compared to GPT-4. Comprehensive experiments on \textit{CharacterEval} demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation\footnote{The source code, data source, and reward model will be publicly accessible after acceptance.}.

Downloads

SlidesTranscript English (automatic)

Next from ACL 2024

ToMBench: Benchmarking Theory of Mind in Large Language Models
poster

ToMBench: Benchmarking Theory of Mind in Large Language Models

ACL 2024

+8
Zhuang Chen and 10 other authors

13 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved