China

Understanding and interpreting culturally specific language remains a significant challenge for multilingual natural language processing (NLP) systems, particularly for less-resourced languages. To address this problem, this paper introduces PRONE, a novel dataset of 2,830 Nepali proverbs, and evaluates the performance of various language models (LMs) in two tasks: (i) identifying the correct meaning of a proverb from multiple choices, and (ii) categorizing proverbs into predefined thematic categories. The models, including both open-source and proprietary, were tested in zero-shot and few-shot settings with prompts in English and Nepali. While models like GPT-4o demonstrated promising results and achieved the highest performance among LMs, they still fall short of human-level accuracy in understanding and categorizing culturally nuanced content, highlighting the need for more inclusive NLP.

EMNLP 2025

Probing the Limits of Multilingual Language Understanding: Low-Resource Language Proverbs as LLM Benchmark for AI Wisdom

proverbs

low resource language

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, yet they continue to face challenges in discourse relation recognition (DRR). Current state-of-the-art methods for DRR primarily rely on smaller pre-trained language models (PLMs). In this study, we conduct a comprehensive analysis of different approaches using both PLMs and LLMs, evaluating their effectiveness for DRR at multiple granularities and under different data availability settings. Our findings indicate that no single approach consistently outperforms the others, and we offer a general comparison framework to guide the selection of the most appropriate model based on specific DRR requirements and data conditions.

Discourse Relation Recognition with Language Models Under Different Data Availability

Consider the example "The bird sang the nursery rhyme beautifully. It made everyone in the room smile". The pronoun 'it' here refers either to the bird or to the event of singing. This example is inherently ambiguous. It cannot be meaningfully disambiguated as an event or entity reference, as both readings result in the same text meaning. This study introduces a new dataset EMBITEXT to preserve ambiguity in the language by navigating through the ambiguity surrounding the pronominal reference to the entity or event. Oftentimes, ambiguity does not necessarily need to be resolved but is modelled carefully. Furthermore, this study explores the capacity of LLMs (Llama, Mistral, Gemini, Claude AI) to embrace ambiguity in generating text that exhibit referential ambiguity via an In-Context learning approach. To evaluate of the dataset, RoBERTa was finetuned on this data to model ambiguity while simultaneously distinguishing between entity or event references. Results demonstrate EmbiText's capacity to advance the ongoing NLP research by modelling linguistic ambiguity in computational environments instead of fully disambiguating it, thereby retaining diverse interpretations where resolution may alter meaning.

EmbiText: Embracing Ambiguity by Annotation, Recognition and Generation of Pronominal Reference with Event-Entity Ambiguity

Understanding the strategies that make expert-led explanations effective is a core challenge in didactics and a key goal for explainable AI. To study this computationally, we introduce ReWIRED, a large corpus of explanatory dialogues annotated by education experts with fine-grained, span-level teaching acts across five levels of explainee knowledge. We use this resource to assess the capabilities of modern language models, finding that while few-shot LLMs struggle to label these acts, fine-tuning is a highly effective methodology. Moving beyond structural annotation, we propose and validate a suite of didactic quality metrics. We demonstrate that a prompt-based evaluation using an LLM as a ``judge'' is required to capture how the functional quality of an explanation aligns with the learner's expertise -- a nuance missed by simpler static metrics. Together, our dataset, modeling insights, and evaluation framework provide a comprehensive methodology to bridge pedagogical principles with computational discourse analysis.

Human and LLM-based Assessment of Teaching Acts in Expert-led Explanatory Dialogues

We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark mono-treebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono‑treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.

Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

We present a dataset of text revisions involving the deletion or replacement of discourse connectives. Manual annotation of a replacement subset reveals that only 19% of edits were judged either necessary or should be left unchanged, with the rest appearing optional. Surprisal metrics from GPT-2 token probabilities and prompt-based predictions from GPT-4.1 correlate with these judgments, particularly in such clear cases.

Information-Theoretic and Prompt-Based Evaluation of Discourse Connective Edits in Instructional Text Revisions

Previous work has shown that simple mask-filling can provide useful information about the discourse informativeness of syntactic structures. Dong et al. (2024) first adopted this approach to investigating preposing constructions. The problem with single token mask fillers was that they were, by and large, ambiguous. We address the issue by adapting the approach of Kalinsky et al. (2023) to support the prediction of multi-token connectives in masked positions. Our first experiment demonstrates that this multi-token mask-filling approach substantially outperforms the previously considered single-token approach in recognizing implicit discourse relations. Our second experiment corroborates previous findings, providing additional empirical support for the role of preposed syntactic constituents in signaling discourse coherence. Overall, our study extends existing mask-filling methods to a new discourse-level task and reinforces the linguistic hypothesis concerning the discourse informativeness of preposed structures.

Multi-token Mask-filling and Implicit Discourse Relations

This paper presents the results obtained by the MELODI team for the three tasks proposed within the DISRPT 2025 shared task on discourse: segmentation, connective identification, and relation classification. 
The competition involves corpora in various languages, in several underlying frameworks, and datasets are given with or without sentence segmentation. 
This year, for the ranked, closed track, the campaign adds as a constraint to train only one model for each task, with an upper bound on the size of the model (no more than 4B parameters).
An additional open track authorizes any size of, possibly non public, models that will not be reproduced by the organizers and thus not ranked.
We compared several fine-tuning approaches either based on encoder-only transformer-based models, or auto-regressive generative ones. 
To be able to train one model on the variety of corpora, we explored various ways of combining data -- by framework, language or language groups, with different sequential orderings --, and the addition of features to guide the model. 
For the closed track, our final submitted system is based on XLM-RoBERTa large for relation identification, and on InfoXLM for segmentation and connective identification. 
Our experiments demonstrate that building a single, multilingual model does not necessarily degrade the performance compared to language-specific systems, with at best 64.06% for relation identification, 90.19% for segmentation and 81.15% for connective identification (on average on the development sets), results that are similar or higher that the ones obtained in previous campaigns.
We also found that a generative approach could give even higher results on relation identification, with at best 64.65% on the dev sets.

DisCuT and DiscReT: MELODI at DISRPT 2025 Multilingual discourse segmentation, connective tagging and relation classification

We present our submission to Task 3 (Discourse Relation Classification) of the DISRPT 2025 shared task. Task 3 introduces a unified set of 17 discourse relation labels across 39 corpora in 16 languages and six discourse frameworks, posing significant multilingual and cross‑formalism challenges. We first benchmark the task by fine‑tuning multilingual BERT‑based models (mBERT, XLM‑RoBERTa‑Base, and XLM‑RoBERTa‑Large) with two argument‑ordering strategies and progressive unfreezing ratios to establish strong baselines. We then evaluate prompt‑based large language models (namely Claude Opus 4.0) in zero‑shot and few‑shot settings to understand how LLMs respond to the newly proposed unified labels. Finally, we introduce HiDAC, a Hierarchical Dual‑Adapter Contrastive learning model. Results show that while larger transformer models achieve higher accuracy, the improvements are modest, and that unfreezing the top 75% of encoder layers yields performance comparable to full fine‑tuning while training far fewer parameters. Prompt‑based models lag significantly behind fine‑tuned transformers, and HiDAC achieves the highest overall accuracy (67.5%) while remaining more parameter‑efficient than full fine‑tuning.

CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework & Multi-lingual Discourse Relation Classification

This paper presents DeDisCo, Georgetown University's entry in the DISRPT 2025 shared task on discourse relation classification. We test two approaches, using an mt5-based encoder and a decoder based approach using the openly available Qwen model. We also experiment on training with augmented dataset for low-resource languages using matched data translated automatically from English, as well as using some additional linguistic features inspired by entries in previous editions of the Shared Task. Our system achieves a macro-accuracy score of 71.28, and we provide some interpretation and error analysis for our results.

DeDisCo at the DISRPT 2025 Shared Task: A System for Discourse Relation Classification

This paper describes the submission of the HITS team to the DISRPT 2025 shared task. The shared task includes three sub-tasks: (1) discourse unit segmentation across formalisms, (2) cross-lingual discourse connective identification, and (3) cross-formalism discourse relation classification. This paper presents our strategies for the DISRPT 2025 Shared Task. In Task 1, our approach involves fine-tuning through multilingual joint training on linguistically motivated language groups. We incorporated two key techniques to improve model performance: a weighted loss function to address the task's significant class imbalance and Fast Gradient Method (FGM) adversarial training to boost the model's robustness.

In task 2, our approach involves building an ensemble of three encoder models whose embeddings are smartly fused together with a multi-head attention layer. We also add Part-Of-Speech tags and dependency relations present in the training file as linguistic features. A CRF layer is added after the classification layer to account for dependencies between adjacent labels. To account for label imbalance, we use focal loss and label smoothing. This ensures our model is robust and flexible enough to handle different languages.

In task 3, we use two-stage fine-tuning framework designed to transfer the nuanced reasoning capabilities of a very large "teacher" model to a compact "student" model so that the smaller model can learn complex discourse relationships. The fine-tuning process follows a curriculum learning framework. In such a framework the model learns to perform increasingly harder tasks. In our case, the model first learns to look at the discourse units and then predict the label followed by looking at Chain-Of-Thought reasoning for harder examples. This way it can learn to internalise such reasoning and increase prediction accuracy on the harder samples.

Downloads

Next from EMNLP 2025

Discourse Relation Recognition with Language Models Under Different Data Availability

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from EMNLP 2025

Discourse Relation Recognition with Language Models Under Different Data Availability

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads