
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

workshop paper
Analysis of LLM’s “Spurious” Correct Answers Using Evidence Information of Multi-hop QA Datasets
keywords:
explainability
large language models
knowledge base
multi-hop question answering
Recent LLMs show an impressive accuracy on one of the hallmark tasks of language understanding, namely Question Answering (QA). However, it is not clear if the correct answers provided by LLMs are actually grounded on the correct knowledge related to the question. In this paper, we use multi-hop QA datasets to evaluate the accuracy of the knowledge LLMs use to answer questions, and show that as much as 31% of the correct answers by the LLMs are in fact spurious, i.e., the knowledge LLMs used to ground the answer is wrong while the answer is correct. We present an analysis of these spurious correct answers by GPT-4 using three datasets in two languages, while suggesting future pathways to correct the grounding information using existing external knowledge bases.