China

We present the submissions of our team to the Unconstrained and LLM tracks of the Computational Models of Reference, Anaphora and Coreference (CRAC2025) shared task, where we ended respectively in the fifth and the first place, but nevertheless with similar scores: average CoNLL-F1 scores of 61.57 and 62.96 on the test set, but with very large differences in computational cost. Indeed, the classical pair-wise resolution system submitted to the Unconstrained track obtained similar performance but with less than 10\% of the computational cost. Reflecting on this fact, we point out problems that we ran into using generative AI to perform coreference resolution. We explain how the framework of text generation stands in the way of a reliable text-global coreference representation. Nonetheless, we realize there are many potential improvements of our LLM-system; we discuss them at the end of this article.

EMNLP 2025

GLaRef@CRAC2025: Should we transform coreference resolution into a text generation task?

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper presents our submission to the CRAC 2025 Shared Task on Multilingual Coreference Resolution in the LLM track. We propose a prompt-based few-shot coreference resolution system where the final inference is performed by Grok-3 using in-context learning. The core of our methodology is a difficulty- aware sample selection pipeline that leverages Gemini Flash 2.0 to compute semantic diffi- culty metrics, including mention dissimilarity and pronoun ambiguity. By identifying and selecting the most challenging training sam- ples for each language, we construct highly informative prompts to guide Grok-3 in predict- ing coreference chains and reconstructing zero anaphora. Our approach secured 3rd place in the CRAC 2025 shared task.

Few-Shot Coreference Resolution with Semantic Difficulty Metrics and In-Context Learning

This paper describes our approach to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. We compete in the LLM track, where the systems are limited to generative text-to-text approaches. Our system is based on Llama 3.1-8B, fine-tuned to tag the document with coreference annotations. We have made one significant modification to the text format provided by the organizers: The model relies on the syntactic head for mention span representation. Additionally, we use joint pre-training, and we train the model to generate empty nodes. We provide an in-depth analysis of the performance of our models, which reveals several implementation problems. Although our system ended up in last place, we achieved the best performance on 10 datasets out of 22 within the track. By fixing the discovered problems in the post-evaluation phase, we improved our results substantially, outperforming all the systems in the LLM track and even some unconstrained track systems.

Fine-Tuned Llama for Multilingual Text-to-Text Coreference Resolution

In this work, we present our system, which ranked second in the CRAC 2025 Shared Task on Multilingual Coreference Resolution (LLM Track). For multilingual coreference resolution, our system mainly uses long-context large language models (LLMs) in a few-shot in-context learning setting. Among the various approaches we explored, few-shot prompting proved to be the most effective, particularly due to the complexity of the task and the availability of high-quality data with referential relationships provided as part of the competition. We employed Gemini 2.5 Pro, one of the best available closed-source long-context LLMs at the time of submission. Our system achieved a CoNLL F1 score of 61.74 on the mini-testset, demonstrating that performance improves significantly with the number of few-shot examples provided, thanks to the model's extended context window. While this approach comes with trade-offs in terms of inference cost and response latency, it highlights the potential of long-context LLMs for tackling multilingual coreference without task-specific fine-tuning. Although direct comparisons with traditional supervised systems are not straightforward, our findings provide valuable insights and open avenues for future work, particularly in expanding support for low-resource languages.

Few-Shot Multilingual Coreference Resolution Using Long-Context Large Language Models

We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. This fourth iteration of the shared task introduces a new LLM track alongside the original unconstrained track, features reduced development and test sets to lower computational requirements, and includes additional datasets. CorPipe 25 represents a complete reimplementation of our previous systems, migrating from TensorFlow to PyTorch. Our system significantly outperforms all other submissions in both the LLM and unconstrained tracks by a substantial margin of 8 percentage points. The source code and trained models are publicly available at https://github.com/ufal/crac2025-corpipe.

CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution

Gundel et al.'s Givenness Hierarchy remains one of the most influental frameworks of Information Status to this date, and has been employed in different technical contexts to account for context-sensitive and hearer-tailored language in human-machine interaction and natural language processing as well as as a topic of linguistic inquiry. At the same time, the data basis upon which this theory has been developed remains relatively thin. Although its applicability to a broad array of languages has been repeatedly confirmed, the empirical evidence presented for certain phenomena, and in particular, with respect to demonstrative determiners and demonstrative pronouns did not always reach conventional levels of statistical significance. In this paper, we provide an empirical, corpus-based re-assessment of two seminal papers for the Givenness Hierarchy, Gundel et al. (1990) and Gundel et al. (1993), where we aim to replicate their findings on the basis of corpora with coreference annotation for their original sample of languages, i.e., Arabic, Chinese, English, Japanese, Korean, Russian and Spanish. We describe the operationalization of Gundel et al.'s `cognitive statuses', their approximation by means of anaphoric relations, the preprocessing of diverse and heterogeneous corpora and evaluate Gundel et al.'s claims. Our contribution is three-fold: We evaluate the Givenness Hierarchy against quantitative data at a scale that allows to assess statistical significance, we discuss challenges and problems encountered in the process, in the preprocessing and in the interpretation of the diverse corpora, we provide two generalizations: a procedure for bootstrapping Givenness Hierarchies for other languages, and possible cross-linguistically applicable tendencies in the systems of referring expressions.

Revisiting the Givenness Hierarchy. A Corpus-Based Evaluation

We tackle the task of mention detection for pair-programming dialogue, a setting which adds several challenges to the task due to the characteristics of natural dialogue, the dynamic environment of the dialogue task, and the domain-specific vocabulary and structures. We compare recent variants of the Llama and GPT families and explore different prompt and context engineering approaches. While aspects like hesitations and references to read-out code and variable names made the task challenging, GPT 4.1 approximated human performance when we provided few-shot examples similar to the inference text and corrected formatting errors.

Mention detection with LLMs in pair-programming dialogue

While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

Training models that can perform well on various NLP tasks requires large amounts of data, which becomes even more apparent with more nuanced tasks such as anaphora and coreference resolution. This paper presents the automatic creation of an Arabic CorefUD dataset through the automatic conversion of the existing gold-annotated OntoNotes.

Towards Adding Arabic to CorefUD

In recent years, research on sign languages has attracted increasing attention in the NLP community and requires more effort from a linguistic perspective. In this paper, we explore coreference resolution in German Sign Language (GSL) primarily through gloss-based analysis. Specifically, in GSL glosses, we conduct a linguistic analysis of coreference, add coreference annotations based on one video, and evaluate the ability of two large language models to resolve coreference. We gain valuable insights into coreference resolution in GSL, which pave the way for future research.

Exploring Coreference Resolution in Glosses of German Sign Language

This study introduces a new ASR-transcribed coreference corpus for French and explores the transferability of coreference resolution models from human-transcribed to ASR-transcribed data. Given the challenges posed by differences in text characteristics and errors introduced by ASR systems, we evaluate model performance using newly constructed parallel human-ASR silver training and gold validation datasets. Our findings show a decline in performance on ASR data for models trained on manual transcriptions. However, combining silver ASR data with gold manual data enhances model robustness. Through detailed error analysis, we observe that models emphasizing recall are more resilient to ASR-induced errors compared to those focusing on precision. The resulting ASR corpus, along with all related materials, is freely available under the CC BY-NC-SA 4.0 license at: https://github.com/ina-foss/french-asr-coreference.

Downloads

Next from EMNLP 2025

Few-Shot Coreference Resolution with Semantic Difficulty Metrics and In-Context Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Few-Shot Coreference Resolution with Semantic Difficulty Metrics and In-Context Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads