China

This paper describes our submissions to the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. We present a comparative study of three architectures: a minimal rule-based baseline, an expert-enhanced system, and a multi-stage generative pipeline using a T5 model in a zero-shot setting. Because per-instance official scores were not available at the time of analysis, we perform a principled sensitivity analysis via simulated paired bootstrap to assess robustness of our comparative claims. Under a wide range of reasonable assumptions the simpler, more constrained systems show substantially better automatic scores for semantic fidelity and the composite AUTORANK metric. We include diagnostic failure analysis grounded in actual system outputs, discuss limitations of embedding-based guardrails, and provide concise reproducibility notes in the Appendix. Full code, experimental configurations, and outputs will be released upon acceptance to ensure complete reproducibility.


EMNLP 2025

HOPE at TSAR 2025 Shared Task: Balancing Control and Complexity in Readability-Controlled Text Simplification

meaning preservation

zero-shot t5 pipeline

rule-based baseline

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This paper describes the system submission of our team OUNLP to the TSAR-2025 shared task on readability-controlled text simplification. Based on the analysis on \Naive Prompting-based method on text simplification, we discovered an interesting finding that the performance of text simplification is highly related the gap between source CERF~\cite{arase2022cefr} level and target CERF level. Inspired by this finding, we propose to two multi-round simplification methods: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint), which are all generated with AI. Our system ranked 7 over 20 teams. Later improvements demonstrates that AI generated code with robust evaluation metrics for verification are promising methods to produce reliable, readability-controlled text simplifications~\footnote{\url{https://github.com/Rickie2k6/Sentence_Simplification}}.

OUNLP at TSAR 2025 Shared Task: AI-Generated Multi-Round Sentence Simplifier

Text simplification is an active research topic with applications in multiple domains. In a simplification pipeline, assessment of text difficulty plays a crucial role as a quality control mechanism: it acts as a ``critic,'' and guides models to generate text at the difficulty level that is required by the user. In this paper, we present our Difficulty-aware Text Simplification System. We evaluate our pipeline using the TSAR shared task dataset and discuss challenges in constructing corpora for training models for assessment of text difficulty.

Know-AI at TSAR 2025 Shared Task: Difficulty-aware Text Simplification System

Vision-Language Models (VLMs) often appearculturally competent but rely on superficial pat.tern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themedcultural imagery through both classification andexplanation analysis. Testing multiple modelson Western festivals, non-Western traditions.and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresentedcultural events, frequently offering vague labelsor dangerously misclassifying emergencies ascelebrations. These failures expose the risksof symbolic shortcuts and highlight the needfor cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodalsystems.

Seeing Symbols, Missing Cultures: Probing Vision-Language Models'Reasoning on Fire lmagery and Cultural Meaning

Abstract Meaning Representation (AMR) is a graph-based semantic representation that has been incorporated into numerous downstream tasks, in particular due to substantial efforts developing text-to-AMR parsing and AMR-to-text generation models. However, there still exists a large gap between fluent, natural sentences and texts generated from AMR-to-text generation models. Prompt-based Large Language Models (LLMs), on the other hand, have demonstrated an outstanding ability to produce fluent text in a variety of languages and domains. In this paper, we investigate the extent to which LLMs can improve the AMR-to-text generated output fluency post-hoc via prompt engineering. We conduct automatic and human evaluations of the results, and ultimately have mixed findings: LLM-generated paraphrases generally do not exhibit improvement in automatic evaluation, but outperform baseline texts according to our human evaluation. Thus, we provide a detailed error analysis of our results to investigate the complex nature of generating highly fluent text from semantic representations.

GPT4AMR: Does LLM-based Paraphrasing Improve AMR-to-text Generation Fluency?

Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.

Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian

An automatic court hearing transcription system is being developed for the Federal Supreme Court of Ethiopia to address the challenges faced in manual transcription. By utilizing Automatic Speech Recognition technology, the system aims to transcribe Amharic language court recordings accurately and efficiently. This innovative solution not only improves the court system but also safeguards the health of transcribers and enhances the overall speed and quality of legal proceedings in Ethiopia. In this study, a self-supervised Transformer based Wave2Vec 2.0 approach has been conducted to build an ASR system. With a dataset comprising over 500 hours of unlabeled data, the system has achieved a remarkable Word Error Rate (WER) of 14.36%, showcasing its effectiveness in transcribing court proceedings with high accuracy.

Wav2Vec-Based Self-Supervised Learning for Court Hearing Transcription

Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a teacher professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of the types of biases represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets, potentially making bias mitigation by self-debiasing harder. HESEIA is available to support bias assessments grounded in educational communities.

An intersectional bias evaluation dataset grounded in educational contexts

In this study, we investigate how author affiliation shapes academic discourse, proposing it as an effective proxy for author perspective in understanding what topics are studied, how nations are framed, and whose realities are prioritised. Using Palestine as a case study, we apply BERTopic and Structural Topic Modelling (STM) to 29,536 English-language academic articles collected from the OpenAlex database. We find that domestic authors focus on practical, local issues like healthcare, education, and the environment, while foreign authors emphasise legal, historical, and geopolitical discussions. These differences, in our interpretation, reflect lived proximity to war and crisis. We also note that while BERTopic captures greater lexical nuance, STM enables covariate-aware comparisons, offering deeper insight into how affiliation correlates with thematic emphasis. We propose extending this framework to other underrepresented countries, including a future study focused on Gaza post-October 7.

Whose Palestine Is It? A Topic Modelling Approach to National Framing in Academic Research

Named Entity Recognition (NER) is the information extraction task of identifying predefined named entities such as person names, location names, organization names and more. High-resource languages have made significant progress in NER tasks. However, low-resource languages such as Kurmanji Kurdish have not seen the same advancements, due to these languages having less available data online. This research aims to close this gap by developing an NER system via fine-tuning XLM-RoBERTa on a manually annotated dataset for Kurmanji. The dataset used for fine-tuning consists of 7,919 annotated sentences, which were manually annotated by three native Kurmanji speakers. The classes labeled in the dataset are Person (PER), Organization (ORG), and Location (LOC). A web-based application has also been developed using Streamlit to make the model more accessible. The model achieved an F1 score of 0.8735, precision of 0.8668, and recall of 0.8803, demonstrating the effectiveness of fine-tuning transformer-based models for NER tasks in low-resource languages. This work establishes a methodology that can be applied to other low-resource languages and Kurdish varieties.

Fine-tuning XLM-RoBERTa for Named Entity Recognition in Kurmanji Kurdish

As Large Language Models (LLMs) are deployed in every aspect of our lives, understanding how they reason about moral issues becomes critical for AI safety. We investigate this using a dataset we curated from Reddit's r/AmItheAsshole, comprising real-world moral dilemmas with crowd-sourced verdicts. Through experiments on five state-of-the-art LLMs across 847 posts, we find a significant and systematic divergence where LLMs are more lenient than humans. Moreover, we find that translating the posts into another language changes LLMs' verdicts, indicating their judgments lack cross-lingual stability.

Downloads

Next from EMNLP 2025

OUNLP at TSAR 2025 Shared Task: AI-Generated Multi-Round Sentence Simplifier

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES