Austria

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model&#39;s decision-making mechanism.
We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanstic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

ACL 2025

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism.
We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanstic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

workshop paper

### Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Message from the General Chair: 
*It is my great pleasure and honor to welcome you to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in beautiful Vienna, Austria, from July 27 to August 1, 2025. ACL2025continues our field’s tradition of excellence in scholarship, innovation, and inclusivity, and I am deeply grateful to the many volunteers who have worked tirelessly to bring this event to life.* 
[Read more](https://drive.google.com/file/d/1GI_hvOpjswAuYdUTromfeDiPpCcqidwg/view?usp=sharing)

To access this event page, you need to log in with the **email address you registered with**. Access credentials will be sent to your email from Underline - subject line "Welcome to ACL 2025". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you need to log in with the **email address you registered with**. 

Welcome to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.

A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity

A common assumption in Computational Linguistics is that text representations learnt by multimodal models are richer and more human-like than those by language-only models, as they are grounded in images or audio—similar to how human language is grounded in real-world experiences. However, empirical studies checking whether this is true are largely lacking. We address this gap by comparing word representations from contrastive multimodal models vs. language only ones in the extent to which they capture experiential information—as defined by an existing norm-based 'experiential model'—and align with human fMRI responses. Our results indicate that, surprisingly, language-only models are superior to multimodal ones in both respects. Additionally, they learn more unique brain-relevant semantic information beyond that shared with the experiential model. Overall, our study highlights the need to develop computational models that better integrate the complementary semantic information provided by multimodal data sources.

Experiential Semantic Information and Brain Alignment: Are Multimodal Models Better than Language Models?

Recent work has argued that large language models (LLMs) are not "abstract reasoners", citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an "abstract reasoner", and why it matters whether LLMs fit the bill.

What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models

Large language models (LLMs) have been deployed in a wide spectrum of domains and applications due to their strong language understanding capabilities obtained through pretraining. However, their performance on specific domain is usually suboptimal due to limited exposure to domain-specific tasks. Adapting LLM to the cinematic domain post unique challenges as it consists of complicated stories with limited textual information accessible from the subtitle or script alone. 

In this paper, we decompose the movie understanding capability into a suite of narrative understanding tasks based on narrative theory. We construct a dataset for these tasks based on resources in the movie domain, and use it to examine the effect of different domain adaptation strategies. Both the dataset and the models are made publicly available.
Our experiment results show the effectiveness of our approach in improving the narrative understanding of LLMs and highlight the trade-offs between domain-specific and general instruction capabilities.

Narrative understanding for Cinema domain large language model

Large Language Models (LLMs) have made significant contributions to cognitive science research. One area of application is narrative understanding. Sap et al. (2022) introduced $\textit{sequentiality}$, an LLM-derived measure that assesses the coherence of a story based on word probability distributions. They reported that recalled stories flowed less sequentially than imagined stories. However, the robustness and generalizability of this narrative flow measure remain unverified. To assess generalizability, we apply $\textit{sequentiality}$ derived from three different LLMs to a new dataset of matched autobiographical and biographical paragraphs. Contrary to previous results, we fail to find a significant difference in narrative flow between autobiographies and biographies. Further investigation reveals biases in the original data collection process, where topic selection systematically influences sequentiality scores. Adjusting for these biases substantially reduces the originally reported effect size. A validation exercise using LLM-generated stories with "good'' and "poor'' flow further highlights the flaws in the original formulation of sequentiality. Our findings suggest that LLM-based narrative flow quantification is susceptible to methodological artifacts. Finally, we provide some suggestions for modifying the $\textit{sequentiality}$ formula to accurately capture narrative flow.

From Stories to Statistics: Methodological Biases in LLM-Based Narrative Flow Quantification

Verbal fluency is a well-established experimental paradigm, used to examine various aspects of human knowledge retrieval, linguistic processing, and cognitive performance as well as, more recently, human creative abilities. In this work, we investigate the predictive capacities of recent large language models, known for their ability to store knowledge and retrieve it with high accuracy and efficiency from their latent space. We focus on switching and clustering patterns and seek evidence to substantiate them as two distinct and separable processes in creative semantic search. We prompt different transformer-based language models with verbal fluency items and ask whether metrics derived from the language models' prediction probabilities or internal attention distributions offer reliable predictors of switching/clustering behaviors in verbal fluency. We find that token probabilities, but especially attention-based metrics have strong statistical power when separating between cases of switching and clustering, in line with prior research on human cognition.

Components of Creativity: Language Model-based Predictors for Clustering and Switching in Verbal Fluency

Despite being ubiquitous in natural language, collocations (e.g., kick+habit) incur a unique processing cost, compared to compositional phrases (kick+door) and idioms (kick+bucket). We confirm this processing cost with behavioural data as well as MINERVA2, a memory model, suggesting that collocations constitute a distinct linguistic category. While the model fails to fully capture the observed human processing patterns, we find that below a specific item frequency threshold, the model’s retrieval failures align with human reaction times across conditions. This suggests an alternative processing mechanism that activates when memory retrieval fails, consistent with an analogical account of language processing.

What does memory retrieval leave on the table? Exploring Semi-compositionality in Language Processing with MINERVA2 and sBERT

Students' academic performance is influenced by various demographic factors, with socioeconomic class being a prominently researched and debated factor. Computer Science research traditionally prioritizes computationally definable problems, yet challenges such as the scarcity of high-quality labeled data and ethical concerns surrounding the mining of personal information can pose barriers to exploring topics like the impact of SES on students' education. Overcoming these barriers may involve automating the collection and annotation of high-quality language data from diverse social groups through human collaboration. Therefore, our focus is on gathering unstructured narratives from Internet forums written by students with low socioeconomic status (SES) using machine learning models and human insights. We developed a hybrid data collection model that semi-automatically retrieved narratives from the Reddit website and created a dataset five times larger than the seed dataset. Additionally, we compared the performance of traditional ML models with recent large language models (LLMs) in classifying narratives written by low-SES students, and analyzed the collected data to extract valuable insights into the socioeconomic challenges these students encounter and the solutions they pursue.

Bridging the Socioeconomic Gap in Education: A Hybrid AI and Human Annotation Approach

The syntactic probing literature has been largely limited to shallow structures like dependency trees, which are unable to capture the subtle differences in sub-surface syntactic structures that yield semantic nuances. These structures are captured by theories of syntax like generative syntax, but have not been researched in the LLM literature due to the difficulties in probing these complex structures with many silent, covert nodes. Our work presents a method for overcoming this limitation by deploying Hewitt and Manning's (2019) dependency-trained probe on sentence constructions whose structural representation is identical in a dependency parse, but differs in theoretical syntax. If a pretrained language model has captured the theoretical syntax structure, then the probe's predicted distances should vary in syntactically-predicted ways. Using this methodology and a novel dataset, we find evidence that LLMs have captured syntactic structures far richer than previously realized, indicating LLMs are able to capture the nuanced meanings that result from sub-surface differences in structural form.

Evidence of Generative Syntax in LLMs

In recent computational psycholinguistics, Merkx and Frank (2021) showed that surprisals from Transformers demonstrate a closer fit to measures of human reading effort than those from Recurrent Neural Networks (RNNs), suggesting that Transformers may capture the cue-based retrieval-like operations in human sentence processing. On the other hand, explicit incorporation of syntactic structures has been shown to improve LMs' predictive power for human cognitive load---for example, Hale et al. (2018) demonstrated that Recurrent Neural Network Grammars (RNNGs), which integrate RNNs with explicit syntactic structures, account for aspects of human brain activity that vanilla RNNs cannot. In this paper, we test the psychometric predictive power of Composition Attention Grammars (CAGs), the integration of Transformers with explicit syntactic structures, to investigate whether they can provide better fit to human gaze durations than vanilla Transformers and RNNGs by capturing cue-based retrieval-like operations on syntactic structures, which could potentially be involved in human sentence processing. The results of our controlled experiment demonstrate that surprisals from CAGs outperformed those from Transformers and RNNGs, suggesting that syntactic attention in CAGs may serve as a mechanistic implementation of human retrieval from syntactically-constructed memory representations.

Premium content

Downloads

Next from ACL 2025

A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES