Austria

We introduce an effective and scalable data selection technique to accelerate the pretraining of large language models (LLMs). Given the variation in quality and informativeness of web-scale corpora, we present the Learn-Focus-Review (LFR) paradigm-a dynamic training approach that adapts to the model&#39;s learning progress. Inspired by human learning techniques like spaced repetition, LFR tracks the model’s learning performance across data instances and prioritizes revisiting challenging and diverse regions of the dataset that are more prone to being forgotten, enabling better retention and more efficient learning. Through experiments spanning over 2200 GPU hours, we show that LFR significantly enhances data efficiency in pretraining while improving downstream performance across commonsense reasoning, question answering, problem-solving, language modeling, and translation tasks. LFR consistently achieves lower perplexity and higher accuracy using just 5\%–19\% of the training tokens as models trained on the full dataset. Notably, LFR matches the performance of industry-standard Pythia models with up to 2$\times$ the parameter count while requiring only 3.2\% of the training tokens. Unlike prior work on data selection, LFR models are Chinchilla-optimal demonstrating the effectiveness of our training methodology.

ACL 2025

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review

We introduce an effective and scalable data selection technique to accelerate the pretraining of large language models (LLMs). Given the variation in quality and informativeness of web-scale corpora, we present the Learn-Focus-Review (LFR) paradigm-a dynamic training approach that adapts to the model's learning progress. Inspired by human learning techniques like spaced repetition, LFR tracks the model’s learning performance across data instances and prioritizes revisiting challenging and diverse regions of the dataset that are more prone to being forgotten, enabling better retention and more efficient learning. Through experiments spanning over 2200 GPU hours, we show that LFR significantly enhances data efficiency in pretraining while improving downstream performance across commonsense reasoning, question answering, problem-solving, language modeling, and translation tasks. LFR consistently achieves lower perplexity and higher accuracy using just 5\%–19\% of the training tokens as models trained on the full dataset. Notably, LFR matches the performance of industry-standard Pythia models with up to 2$\times$ the parameter count while requiring only 3.2\% of the training tokens. Unlike prior work on data selection, LFR models are Chinchilla-optimal demonstrating the effectiveness of our training methodology.

workshop paper

### Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Message from the General Chair: 
*It is my great pleasure and honor to welcome you to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in beautiful Vienna, Austria, from July 27 to August 1, 2025. ACL2025continues our field’s tradition of excellence in scholarship, innovation, and inclusivity, and I am deeply grateful to the many volunteers who have worked tirelessly to bring this event to life.* 
[Read more](https://drive.google.com/file/d/1GI_hvOpjswAuYdUTromfeDiPpCcqidwg/view?usp=sharing)

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to ACL 2025". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you need to log in with the **email address you registered with**. 

Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Despite being ubiquitous in natural language, collocations (e.g., kick+habit) incur a unique processing cost, compared to compositional phrases (kick+door) and idioms (kick+bucket). We confirm this processing cost with behavioural data as well as MINERVA2, a memory model, suggesting that collocations constitute a distinct linguistic category. While the model fails to fully capture the observed human processing patterns, we find that below a specific item frequency threshold, the model’s retrieval failures align with human reaction times across conditions. This suggests an alternative processing mechanism that activates when memory retrieval fails, consistent with an analogical account of language processing.

What does memory retrieval leave on the table? Exploring Semi-compositionality in Language Processing with MINERVA2 and sBERT

Students' academic performance is influenced by various demographic factors, with socioeconomic class being a prominently researched and debated factor. Computer Science research traditionally prioritizes computationally definable problems, yet challenges such as the scarcity of high-quality labeled data and ethical concerns surrounding the mining of personal information can pose barriers to exploring topics like the impact of SES on students' education. Overcoming these barriers may involve automating the collection and annotation of high-quality language data from diverse social groups through human collaboration. Therefore, our focus is on gathering unstructured narratives from Internet forums written by students with low socioeconomic status (SES) using machine learning models and human insights. We developed a hybrid data collection model that semi-automatically retrieved narratives from the Reddit website and created a dataset five times larger than the seed dataset. Additionally, we compared the performance of traditional ML models with recent large language models (LLMs) in classifying narratives written by low-SES students, and analyzed the collected data to extract valuable insights into the socioeconomic challenges these students encounter and the solutions they pursue.

Bridging the Socioeconomic Gap in Education: A Hybrid AI and Human Annotation Approach

The syntactic probing literature has been largely limited to shallow structures like dependency trees, which are unable to capture the subtle differences in sub-surface syntactic structures that yield semantic nuances. These structures are captured by theories of syntax like generative syntax, but have not been researched in the LLM literature due to the difficulties in probing these complex structures with many silent, covert nodes. Our work presents a method for overcoming this limitation by deploying Hewitt and Manning's (2019) dependency-trained probe on sentence constructions whose structural representation is identical in a dependency parse, but differs in theoretical syntax. If a pretrained language model has captured the theoretical syntax structure, then the probe's predicted distances should vary in syntactically-predicted ways. Using this methodology and a novel dataset, we find evidence that LLMs have captured syntactic structures far richer than previously realized, indicating LLMs are able to capture the nuanced meanings that result from sub-surface differences in structural form.

Evidence of Generative Syntax in LLMs

In recent computational psycholinguistics, Merkx and Frank (2021) showed that surprisals from Transformers demonstrate a closer fit to measures of human reading effort than those from Recurrent Neural Networks (RNNs), suggesting that Transformers may capture the cue-based retrieval-like operations in human sentence processing. On the other hand, explicit incorporation of syntactic structures has been shown to improve LMs' predictive power for human cognitive load---for example, Hale et al. (2018) demonstrated that Recurrent Neural Network Grammars (RNNGs), which integrate RNNs with explicit syntactic structures, account for aspects of human brain activity that vanilla RNNs cannot. In this paper, we test the psychometric predictive power of Composition Attention Grammars (CAGs), the integration of Transformers with explicit syntactic structures, to investigate whether they can provide better fit to human gaze durations than vanilla Transformers and RNNGs by capturing cue-based retrieval-like operations on syntactic structures, which could potentially be involved in human sentence processing. The results of our controlled experiment demonstrate that surprisals from CAGs outperformed those from Transformers and RNNGs, suggesting that syntactic attention in CAGs may serve as a mechanistic implementation of human retrieval from syntactically-constructed memory representations.

Investigating Psychometric Predictive Power of Syntactic Attention

While natural language is processed incrementally, it is unclear whether the syntactic structure prediction process is universal across languages or language-specific. This study investigates this question by revisiting parsing strategies of syntactic language models that incrementally predict both the next token and the associated syntactic structure. Unlike previous studies that have focused on a few strategies, we examine a wide range of strategies by introducing different parameterizations of speculation, which quantifies the degree to which a model predicts syntactic structure before encountering the corresponding tokens. The experiments with 10 typologically diverse languages reveal that the optimal strategy differs depending on the language and the beam size.

Is Incremental Structure Prediction Process Universal across Languages?:Revisiting Parsing Strategy through Speculation

In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use NLI as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish and their corresponding variants. Empirical analysis of comprehensive crosslingual and in-context learning experiments with respectively, encoder-only and decoder-based Large Language Models (LLMs), reveals a performance drop when processing linguistic variations, with more pronounced effects observed in Basque. Error analysis indicates that lexical overlap plays no role, suggesting that linguistic variation represents the primary reason for the lower results. All data and code are publicly available under Attribution-NonCommercial 4.0 International license.

Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

In sentences such as John began the book, the complement noun, lexically denoting an entity, is interpreted as an event. This phenomenon is known in linguistics as complement coercion: the event associated with the verb is not overtly expressed but can be recovered from the meanings of other constituents, context and world knowledge. We investigate whether language models (LMs) can exploit sentence structure and compositional meaning to recover plausible events in complement coercion. For the first time, we tested different LMs in Norwegian, a low-resource language with high syntactic variation in coercion constructions across aspectual verbs. Results reveal that LMs struggle with retrieving plausible events and with ranking them above less plausible ones. Moreover, we found that LMs do not exploit the compositional properties of coercion sentences in their predictions.

Compositionality and Event Retrieval in Complement Coercion: A Study of Language Models in a Low-resource Setting

Knowing which words language learners struggle with is crucial for developing personalised education technologies. In this paper, we advocate for the novel task of "dictionary look-up predication" as a means for evaluating the complexity of words in reading tasks. We release the Dictionary Look-Up development dataset (DLU-dev) and the Dialogue Dictionary Look-Up dataset (D-DLU), which is based on chatbot dialogues. We demonstrate that dictionary look-up is a challenging task for LLMs (results are presented for LLaMA, Gemma and Longformer models). We explore finetuning with the ROC* loss function as a more appropriate loss for this task than the commonly used Binary Cross Entropy (BCE). We investigate the transfer between DLU and the related tasks of Complex Word Identification (CWI) and Semantic Error Prediction (SEP); establishing new state-of-the-art results for SEP.

DLU: Dictionary Look-Up Data and Prediction

Recent work has investigated whether extant neural language models (LMs) have an inbuilt inductive bias towards the acquisition of attested typologically-frequent grammatical patterns as opposed to infrequent, unattested, or impossible patterns using artificial languages (White and Cotterell, 2021; Kuribayashi et al., 2024). The use of artificial languages facilitates isolation of specific grammatical properties from other factors such as lexical or real-world knowledge, but also risks oversimplification of the problem.

In this paper, we examine the use of Generalized Categorial Grammars (GCGs) (Wood, 2014) as a general framework to create artificial languages with a wider range of attested word order patterns, including those where the subject intervenes between verb and object (VSO, OSV) and unbounded dependencies in object relative clauses. 
In our experiments, we exemplify our approach by extending White and Cotterell (2021) and report some significant differences from existing results.

GCG-Based Artificial Languages for Evaluating Inductive Biases of Neural Language Models

This study investigates the generalization abilities of discriminative transformers in Natural Language Inference (NLI) tasks, focusing on their tendency to rely on superficial features and dataset biases rather than genuine linguistic understanding. We argue that performance gaps between training and analysis datasets do not necessarily indicate a lack of knowledge but rather a misalignment between the decision boundaries of the classifier head and the representations learned by the encoder. By analyzing the representation space of NLI models on these datasets, we show that, despite poor accuracy based on final predictions, samples from opposing classes often remain linearly separable in the encoder's representation space. This suggests that the encoders possess sufficient knowledge to perform the NLI task effectively, despite the classifier head's challenges.

Premium content

Downloads

Next from ACL 2025

What does memory retrieval leave on the table? Exploring Semi-compositionality in Language Processing with MINERVA2 and sBERT

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES