Austria

Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information and the relations between the query and the retrieved answers. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.

ACL 2025

Language Model Re-rankers are Fooled by Lexical Similarities

workshop paper

### Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Message from the General Chair: 
*It is my great pleasure and honor to welcome you to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in beautiful Vienna, Austria, from July 27 to August 1, 2025. ACL2025continues our field’s tradition of excellence in scholarship, innovation, and inclusivity, and I am deeply grateful to the many volunteers who have worked tirelessly to bring this event to life.* 
[Read more](https://drive.google.com/file/d/1GI_hvOpjswAuYdUTromfeDiPpCcqidwg/view?usp=sharing)

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to ACL 2025". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you need to log in with the **email address you registered with**. 

Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Social media has emerged as a valueable source for early pandemic detection, as repeated mentions of symptoms by users may signal the onset of an outbreak. However, to be a reliable system, validation through fact-checking and verification against official health records is essential. Without this step, systems risk spreading misinformation to the public. The effectiveness of these systems also depend on their ability to process data in multiple languages, given the multilingual nature of social media data. Yet, many NLP datasets and disease surveillance system remain heavily English-centric, leading to significant performance gaps for low-resource languages. This issue is especially critical in Southeast Asia, where symptom expression may vary culturally and linguistically. Therefore, this study evaluates the symptom detection capabilities of LLMs in social media posts across multiple languages, models, and symptoms to enhance health-related fact-checking. Our results reveal significant language-based discrepancies, with European languages outperforming under-resourced Southeast Asian languages. Furthermore, we identify symptom-specific challenges, particularly in detecting respiratory illnesses such as influenza, which LLMs tend to overpredict. The overestimation or misclassification of symptom mentions can lead to false alarms or public misinformation when deployed in real-world settings. This underscores the importance of symptom detection as a critical first step in medical fact-checking within early outbreak detection systems.



Multilingual Symptom Detection on Social Media: Enhancing Health-related Fact-checking with LLMs


We propose a method of improving the performance of question answering based on the interpretation of criminal law regulations in the Korean language by using large language models. In this study, we develop a system that accumulates legislative texts and case precedents related to criminal procedures published on the Internet.The system searches for relevant legal provisions and precedents related to the query under the RAG (Retrieval-Augmented Generation) framework. It generates accurate responses to questions by conducting reasoning through large language models based on these relevant laws and precedents. As an application example of this system, it can be utilized to support decision making in investigations and legal interpretation scenarios within the field of Korean criminal law.

RAG based Question Answering of Korean Laws and Precedents

Structured fact verification benchmarks like AVeriTeC decompose claims into QA pairs to support fine-grained reasoning. However, current systems generate QA pairs independently for each evidence sentence, leading to redundancy, drift, and noise. We introduce a modular LLM-based QA consolidation module that jointly filters, clusters, and rewrites QA pairs at the claim level. Experiments show that this method improves evidence quality and veracity prediction accuracy. Our analysis also highlights the impact of model scale and alignment on downstream performance.


GQC: LLM-Based Grouped QA Consolidation for Open-Domain Fact Verification at AVeriTeC

While Large Language Models (LLMs) with retrieval augmented generation (RAG) capabilities are increasingly used to verify information, their reliability for political fact-checking remains questionable. We present a comprehensive evaluation of eight state-of-the-art LLMs on their ability to fact-check political statements related to the 2024 U.S. presidential election. Using a dataset of 1,374 statements derived from 530 original claims fact-checked by major organizations in the run-up to the elections, we test both original and reformulated versions of statements. Our findings reveal that even the best-performing models achieve only modest accuracy (macro F1 score of 0.51), with RAG providing minimal improvements over models without search capabilities. Models particularly struggle with nuanced "misleading" statements and demonstrate poor robustness to reformulations. These results indicate that users relying on LLMs for political fact-checking are likely to receive inconsistent and sometimes incorrect assessments, even for statements that have been previously fact-checked by professional organizations and are readily available online.

RAG Is Not Enough: Evidence of Fact-Checking Limitations During the 2024 U.S. Presidential Elections

Automated fact-checking (AFC) of factual claims require efficiency and accuracy. Existing evaluation frameworks like Ev2R achieve strong semantic grounding but incur substantial computational cost, while simpler metrics based on overlap or one-to-one matching often misalign with human judgments. In this paper, we introduce SemQA, a lightweight and accurate evidence-scoring metric that combines transformer-based question scoring with bidirectional NLI entailment on answers. We evaluate SemQA by conducting human evaluations, analyzing correlations with existing metrics, and examining representative examples.

SemQA: Evaluating Evidence with Question Embeddings and Answer Entailment for Fact Verification

Given the limited computational and financial resources of news agencies, real-life usage of fact-checking systems requires fast response times. For this reason, our submission to the FEVER-8 claim verification shared task focuses on optimizing the efficiency of such pipelines built around subtasks such as evidence retrieval and veracity prediction. We propose the Semantic Filtering for Efficient Fact Checking (SFEFC) strategy, which is inspired by the FEVER-8 baseline and designed with the goal of reducing the number of LLM calls and other computationally expensive subroutines. Furthermore, we explore the reuse of cosine similarities initially calculated within a dense retrieval step to retrieve the top 10 most relevant evidence sentence sets. We use these sets for semantic filtering methods based on similarity scores and create filters for particularly hard classification labels "Not Enough Information" and "Conflicting Evidence/Cherrypicking" by identifying thresholds for potentially relevant information and the semantic variance within these sets. Compared to the parallelized FEVER-8 baseline, which takes 33.88 seconds on average to process a claim according to the FEVER-8 shared task leaderboard, our non-parallelized system remains competitive in regard to AVeriTeC retrieval scores while reducing the runtime to 7.01 seconds, achieving the fastest average runtime per claim.

Exploring Semantic Filtering Heuristics For Efficient Claim Verification

With the growing volume of misinformation online, automated fact-checking systems are becoming increasingly important. This paper presents SANCTUARY, an efficient pipeline for evidence-based verification of real-world claims. Our approach consists of three stages: Hypothetical Question & Passage Generation, a two-step Retrieval-Augmented Generation (RAG) hybrid evidence retrieval, and structured reasoning and prediction, which leverages two lightweight Large Language Models (LLMs). On the challenging AVeriTeC benchmark, our system achieves 25.27 points on the new AVeriTeC score (Ev2R recall), outperforming the previous state-of-the-art baseline by 5 absolute points (1.25× relative improvement). Sanctuary demonstrates that careful retrieval, reasoning strategies and well-integrated language models can substantially advance automated fact-checking performance.

SANCTUARY: An Efficient Evidence-based Automated Fact Checking System

In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year's submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single Nvidia A10 GPU, 23GB of graphical memory and 60s running time per claim.

AIC CTU@FEVER 8: On-premise fact checking through long context RAG

With rapid advancements in large language models (LLMs) across artificial intelligence, machine learning, and data sci-ence, there is a growing need for evaluation frameworks that go beyond traditional performance metrics. Conventional methods focus mainly on accuracy and computational metrics, often neglecting user experience and community interaction—key elements in open-source environments. This paper intro-duces a multi-dimensional, user-centered evaluation frame-work, integrating metrics like User Engagement Index (UEI), Community Response Rate (CRR), and a Time Weight Factor (TWF) to assess LLMs' real-world impact. Additionally, we propose an adaptive weighting mechanism using Bayesian op-timization to dynamically adjust metric weights for more ac-curate model evaluation. Experimental results confirm that our framework effectively identifies models with strong user engagement and community support, offering a balanced, data-driven approach to open-source LLM evaluation. This frame-work serves as a valuable tool for developers and researchers in selecting and improving open-source models.

Multi-Dimensional Evaluation of Open-Source Language Models: Based on Machine Learning and Bayesian Optimization

Spatial representations are fundamental to human cognition, as understanding spatial relationships between objects is essential in daily life. Language serves as an indispensable tool for communicating spatial information, creating a close connection between spatial representations and spatial language. Large language models (LLMs), theoretically, possess spatial cognition due to their proficiency in natural language processing. This study examines the spatial representations of LLMs by employing traditional spatial tasks used in human experiments and comparing the models' performance to that of humans. The results indicate that LLMs resemble humans in selecting spatial prepositions to describe spatial relationships and exhibit a preference for vertically oriented spatial terms. However, the human tendency to better represent locations along specific axes is absent in the performance of LLMs. This finding suggests that, although spatial language is closely linked to spatial representations, the two are not entirely equivalent.

Premium content

Downloads

Next from ACL 2025

Multilingual Symptom Detection on Social Media: Enhancing Health-related Fact-checking with LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES