![Lecture image placeholder](/_next/image?url=https%3A%2F%2Fassets.underline.io%2Flecture%2F102520%2Fposter%2Flarge-3f9444d75272787df9e0c17a9f1b7e2a.png&w=3840&q=75)
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.
![Lecture placeholder background](/_next/image?url=https%3A%2F%2Fassets.underline.io%2Flecture%2F102520%2Fposter%2Flarge-3f9444d75272787df9e0c17a9f1b7e2a.png&w=3840&q=75)
poster
Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks
keywords:
trust and reliability
explaianble nlp
llms
Large language models (LLMs) are proficient at generating fluent text with minimal task-specific supervision. However, their ability to generate rationales for knowledge-intensive tasks (KITs) remains under-explored. Generating rationales for KIT solutions, such as commonsense multiple-choice QA, requires external knowledge to support predictions and refute alternate options. In this work, we consider the task of generating retrieval-augmented rationalization of KIT model predictions via external knowledge guidance within a few-shot setting. Surprisingly, crowd-workers preferred LLM-generated rationales over existing crowd-sourced rationales, generated in a similar knowledge-guided setting, on aspects such as factuality, sufficiency, and convincingness. However, fine-grained evaluation of such rationales highlights the need for further improvements in conciseness, novelty, and domain invariance. Additionally, through an expert-sourced study evaluating the reliability of the rationales, we demonstrate that humans' trust in LLM-generated rationales erodes when communicated faithfully, i.e., without taking model prediction accuracy into account. We find that even instrumenting simple guardrails can be effective for reliable rationalization.