Austria

In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question–answer pairs that cover seven formats - true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland’s Zaratan cluster. Task 1 automatically scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.

ACL 2025

Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

workshop paper

### Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Message from the General Chair: 
*It is my great pleasure and honor to welcome you to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in beautiful Vienna, Austria, from July 27 to August 1, 2025. ACL2025continues our field’s tradition of excellence in scholarship, innovation, and inclusivity, and I am deeply grateful to the many volunteers who have worked tirelessly to bring this event to life.* 
[Read more](https://drive.google.com/file/d/1GI_hvOpjswAuYdUTromfeDiPpCcqidwg/view?usp=sharing)

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to ACL 2025". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you need to log in with the **email address you registered with**. 

Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

We revisit the BARTABSA framework for aspect-based sentiment analysis with modern decoder LLMs to assess the importance of explicit structure modeling today. Our updated implementation - BARTABSA++ - features architectural enhancements that boost performance and training stability.
Systematic testing with various encoder-decoder architectures shows that BARTABSA++ with BART-Large achieves state-of-the-art results, even surpassing a finetuned GPT-4o model.
Our analysis indicates the encoder's representational quality is vital, while the decoder's role is minimal, explaining the limited benefits of scaling decoder-only LLMs for this task. These findings underscore the complementary roles of explicit structured modeling and large language models, indicating structured approaches remain competitive for tasks requiring precise relational information extraction.

BARTABSA++: Revisiting BARTABSA with Decoder LLMs

In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81% in the best-case scenario.

Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs

Large Language Models (LLMs) have brought significant breakthroughs across all areas of Natural Language Processing (NLP), including Information Extraction (IE). However, knowledge gaps remain regarding their effectiveness in extracting entity-relation triplets, i.e. Joint Relation Extraction (JRE). JRE has been a key operation in creating knowledge bases that can be used to enhance Retrieval Augmented Generation (RAG) systems. Prior work highlights low-quality triplets generated by LLMs. Thus, this work investigates the impact of incorporating linguistic structures, such as constituency and dependency trees and semantic role labeling, to enhance the quality of the extracted triplets. The findings suggest that incorporating specific structural information enhances the uniqueness and topical relevance of the triplets, particularly in scenarios where multiple relationships are present.

From Syntax to Semantics: Evaluating the Impact of Linguistic Structures on LLM-Based Information Extraction

Event-keyed summarization (EKS) requires summarizing a specific event described in a document given the document text and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event as given by multiple sources. We introduce **SEAMuS** (**S**ummaries of **E**vents **A**cross **Mu**ltiple **S**ources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMuS dataset for cross-document argument extraction. We present a suite of baselines on SEAMuS—covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs—along with detailed ablations and a human evaluation study, showing SEAMuS to be a valuable benchmark for this new task.

Cross-Document Event-Keyed Summarization

Information Extraction (IE) plays a pivotal role in transforming unstructured data into structured formats, such as Knowledge Graphs. One of the main tasks within IE is Relation Extraction (RE), which identifies relations between entities in text data. This process enriches the semantic understanding of documents, enabling more precise information retrieval and query answering. Recent works leveraging pre-trained language models have demonstrated significant performance improvements in RE. In the current era of Large Language Models (LLMs), fine-tuning these LLMs can mitigate the limitations of zero-shot RE methods, particularly in overcoming the domain adaptation challenges inherent in RE. This work explores not only the effectiveness of fine-tuned LLMs but also their integration into a Retrieval-Augmented Generation (RAG)-based RE approach to address domain adaptation challenges when general-purpose LLMs serve as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), and Re-TACRED datasets reveal substantial performance improvements with fine-tuned LLMs, such as Llama2-7B, Mistral-7B, and Flan-T5 Large and surpass previous methods on these datasets.

Fine-Tuning Large Language Models for Relation Extraction within a Retrieval-Augmented Generation Framework

This paper compares two approaches for table extraction from images: deep learning computer vision and Multimodal Large Language Models (MLLMs). Computer vision models for table extraction, such as the Table Transformer model (TATR), have enhanced the extraction of complex table structural layouts by leveraging deep learning for precise structural recognition combined with traditional Optical Character Recognition (OCR). Conversely, MLLMs, which process both text and image inputs, present a novel approach by potentially bypassing the limitations of TATR plus OCR methods altogether. Models such as GPT-4o, Phi-3 Vision, and Granite Vision 3.2 demonstrate the potential of MLLMs to analyze and interpret table images directly, offering enhanced accuracy and robust extraction capabilities. A state-of-the-art metric like Grid Table Similarity (GriTS) evaluated these methodologies, providing nuanced insights into structural and text content effectiveness. Utilizing the PubTables-1M dataset, a comprehensive and widely used benchmark in the field, this study highlights the strengths and limitations of each approach, setting the stage for future innovations in table extraction technologies. Deep learning computer vision techniques still have a slight edge when extracting table structural layout, but in terms of text cell content, MLLMs are far better.

Benchmarking Table Extraction: Multimodal LLMs vs Traditional OCR

While LLMs have grown popular in sequence labeling, linear-chain conditional
random fields (CRFs) remain a popular alternative
with the ability to directly model interactions between labels.
However, the Markov assumption limits them to interactions between adjacent labels.
Weighted finite-state transducers (FSTs), in contrast, can model
distant label–label interactions, but exact label inference is intractable in general.
In this work, we present regular-pattern-sensitive
CRFs (RPCRFs), a method of enriching standard
linear-chain CRFs with the ability to learn
long-distance label interactions through user-specified patterns.
This approach allows users to write regular-expression
label patterns concisely specifying which types of interactions
the model should take into account, allowing
the model to learn from data whether and in
which contexts these patterns occur. The result
can be interpreted alternatively as a CRF augmented with additional,
non-local potentials,
or as a finite-state transducer whose structure
is defined by a set of easily-interpretable patterns.
Critically, exact training and inference
are tractable for many pattern sets. We detail
how an RPCRF can be automatically constructed from a set of user-specified patterns,
and demonstrate the model’s effectiveness on
a sequence of three synthetic sequence modeling datasets.

Regular-pattern-sensitive CRFs for Distant Label Interactions

Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of other world's languages. In this paper, we extend these probing methods to a multilingual context, investigating how LLMs encode linguistic structures across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results provide insights into how linguistic structures are represented differently across languages in LLMs and emphasize the need for improved structure modeling for low-resource languages.

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis

Text-to-SQL is a task with excellent prospects and challenges, and it aims to convert natural language queries (NL) into corresponding structured query language (SQL) statements. The main challenge of this task is how to efficiently transform unstructured data and structured data. In recent years, the emergence of large language models (LLMs) has further promoted the development of this field. However, current LLM-based text-to-SQL methods rely on specific few-shot example construction, resulting in poor performance across domains. To solve this problem, we propose a text-to-SQL method of self-contrastive loop of thought structure. This method designs the LLM inference process as a loop structure based on the comparison of positive and negative examples. The model optimizes the generated results through continuous verification and error correction, greatly improving accuracy and reducing dependence on few-shot example construction. The experimental results on SPIDER and BIRD datasets show that this method can generate SQL with higher precision without relying on few-shot example construction.

Premium content

Downloads

Next from ACL 2025

BARTABSA++: Revisiting BARTABSA with Decoder LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES