Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/fkw6-j889

poster

ACL 2024

August 12, 2024

Bangkok, Thailand

imapScore: Medical Fact Evaluation Made Easy

keywords:

medical fact verification

automatic nlg evaluation

medical qa

Automatic evaluation of natural language generation (NLG) tasks has gained extensive research interests, since it can rapidly assess the performance of large language models (LLMs). However, automatic NLG evaluation struggles with medical QA because it fails to focus on the crucial correctness of medical facts throughout the generated text. To address this, this paper introduces a new data structure, \textit{imap}, designed to capture key information in questions and answers, enabling evaluators to focus on essential details. The \textit{imap} comprises three components: Query, Constraint, and Inform, each of which is in the form of term-value pairs to represent medical facts in a structural manner. We then introduce \textit{imap}Score, which compares the corresponding medical term-value pairs in the \textit{imap} to score generated texts. We utilize GPT-4 to extract \textit{imap} from questions, human-annotated answers, and generated responses. To mitigate the diversity in medical terminology for fair term-value pairs comparison, we use a medical knowledge graph to assist GPT-4 in determining matches. To compare \textit{imap}Score with existing NLG metrics, we establish a new benchmark dataset. The experimental results show that \textit{imap}Score consistently outperforms state-of-the-art metrics, demonstrating an average improvement of 79.8\% in correlation with human scores. Furthermore, incorporating \textit{imap} into n-gram, embedding, and LLM metrics boosts the base versions, increasing correlation with human scores by averages of 89.9\%, 81.7\%, and 32.6\%, respectively.

Downloads

SlidesTranscript English (automatic)

Next from ACL 2024

Debiasing Large Language Models with Structured Knowledge
poster

Debiasing Large Language Models with Structured Knowledge

ACL 2024

Tianyu ZhaoManabu Okumura
Congda Ma and 2 other authors

12 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved