Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia? In this paper, we focus on inconsistencies, a specific type of factual inaccuracy. We introduce the task of corpus-level inconsistency detection and present WikiCollide, a human-annotated dataset for this task. We also propose CLAIRE, an agent-based system combining an LLM with information retrieval to effectively identify inconsistencies, which outperforms strong LLM baselines by 2.1% in terms of AUROC on our dataset. Based on our findings, we estimate that at least 79.9 million facts (approximately 3.3%) in the English Wikipedia contradict at least one other fact within the corpus (99% confidence interval: 37.6 million to 121.9 million). We further show that these inconsistencies propagate into widely-used NLP datasets, affecting gold labels in at least 7.3% of examples in the fact-verification dataset FEVEROUS and 4.0% in the question-answering dataset AmbigQA. In a user study with experienced Wikipedia editors, 87.5% of participants reported increased confidence in identifying inconsistencies when using CLAIRE, discovering on average 64.7% more inconsistencies in the same amount of time. Our results demonstrate that LLM-based tools can effectively assist humans in detecting inconsistencies in large-scale corpora.