China

This paper presents an approach to automated text simplification for CEFR A2 and B1 levels using large language models and prompt engineering.
We evaluate seven models across three prompting strategies: short, descriptive, and descriptive with examples. A two-round evaluation system using LLM-as-a-Judge and traditional metrics for text simplification determines optimal model-prompt combinations for final submissions. Results demonstrate that descriptive prompts consistently outperform other strategies across all models, achieving 46-65\% of first-place rankings. Qwen3 shows superior performance for A2-level simplification, while B1-level results are more balanced across models. The LLM-as-a-Judge evaluation method shows strong alignment with traditional metrics while providing enhanced explainability.

EMNLP 2025

EasyJon at TSAR 2025 Shared Task: Evaluation of Automated Text Simplification with LLM-as-a-Judge

llm-as-a-judge

automatic text simplification

plain language

automated evaluation

large language models

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We present an agent-based system for the TSAR 2025 Shared Task on Readability-Controlled Text Simplification, which requires simplifying English paragraphs from B2+ levels to target A2 or B1 levels while preserving meaning. Our approach employs specialized agents for keyword extraction, text generation, and evaluation, coordinated through an iterative refinement loop. The system integrates a CEFR vocabulary classifier, pretrained evaluation models, and few-shot learning from trial data. Through iterative feedback between the evaluator and writer agents, our system automatically refines outputs until they meet both readability and semantic preservation constraints. This architecture achieved Xth position among participating teams, showing the effectiveness of combining specialized LLMs with automated quality control strategies for text simplification.


Uniandes at TSAR 2025 Shared Task: Multi-Agent CEFR Text Simplification with Automated Quality Assessment and Iterative Refinement

This paper describes our submissions to the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. We present a comparative study of three architectures: a minimal rule-based baseline, an expert-enhanced system, and a multi-stage generative pipeline using a T5 model in a zero-shot setting. Because per-instance official scores were not available at the time of analysis, we perform a principled sensitivity analysis via simulated paired bootstrap to assess robustness of our comparative claims. Under a wide range of reasonable assumptions the simpler, more constrained systems show substantially better automatic scores for semantic fidelity and the composite AUTORANK metric. We include diagnostic failure analysis grounded in actual system outputs, discuss limitations of embedding-based guardrails, and provide concise reproducibility notes in the Appendix. Full code, experimental configurations, and outputs will be released upon acceptance to ensure complete reproducibility.


HOPE at TSAR 2025 Shared Task: Balancing Control and Complexity in Readability-Controlled Text Simplification

This paper describes the system submission of our team OUNLP to the TSAR-2025 shared task on readability-controlled text simplification. Based on the analysis on \Naive Prompting-based method on text simplification, we discovered an interesting finding that the performance of text simplification is highly related the gap between source CERF~\cite{arase2022cefr} level and target CERF level. Inspired by this finding, we propose to two multi-round simplification methods: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint), which are all generated with AI. Our system ranked 7 over 20 teams. Later improvements demonstrates that AI generated code with robust evaluation metrics for verification are promising methods to produce reliable, readability-controlled text simplifications~\footnote{\url{https://github.com/Rickie2k6/Sentence_Simplification}}.

OUNLP at TSAR 2025 Shared Task: AI-Generated Multi-Round Sentence Simplifier

Text simplification is an active research topic with applications in multiple domains. In a simplification pipeline, assessment of text difficulty plays a crucial role as a quality control mechanism: it acts as a ``critic,'' and guides models to generate text at the difficulty level that is required by the user. In this paper, we present our Difficulty-aware Text Simplification System. We evaluate our pipeline using the TSAR shared task dataset and discuss challenges in constructing corpora for training models for assessment of text difficulty.

Know-AI at TSAR 2025 Shared Task: Difficulty-aware Text Simplification System

Vision-Language Models (VLMs) often appearculturally competent but rely on superficial pat.tern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themedcultural imagery through both classification andexplanation analysis. Testing multiple modelson Western festivals, non-Western traditions.and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresentedcultural events, frequently offering vague labelsor dangerously misclassifying emergencies ascelebrations. These failures expose the risksof symbolic shortcuts and highlight the needfor cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodalsystems.

Seeing Symbols, Missing Cultures: Probing Vision-Language Models'Reasoning on Fire lmagery and Cultural Meaning

Abstract Meaning Representation (AMR) is a graph-based semantic representation that has been incorporated into numerous downstream tasks, in particular due to substantial efforts developing text-to-AMR parsing and AMR-to-text generation models. However, there still exists a large gap between fluent, natural sentences and texts generated from AMR-to-text generation models. Prompt-based Large Language Models (LLMs), on the other hand, have demonstrated an outstanding ability to produce fluent text in a variety of languages and domains. In this paper, we investigate the extent to which LLMs can improve the AMR-to-text generated output fluency post-hoc via prompt engineering. We conduct automatic and human evaluations of the results, and ultimately have mixed findings: LLM-generated paraphrases generally do not exhibit improvement in automatic evaluation, but outperform baseline texts according to our human evaluation. Thus, we provide a detailed error analysis of our results to investigate the complex nature of generating highly fluent text from semantic representations.

GPT4AMR: Does LLM-based Paraphrasing Improve AMR-to-text Generation Fluency?

Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.

Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian

An automatic court hearing transcription system is being developed for the Federal Supreme Court of Ethiopia to address the challenges faced in manual transcription. By utilizing Automatic Speech Recognition technology, the system aims to transcribe Amharic language court recordings accurately and efficiently. This innovative solution not only improves the court system but also safeguards the health of transcribers and enhances the overall speed and quality of legal proceedings in Ethiopia. In this study, a self-supervised Transformer based Wave2Vec 2.0 approach has been conducted to build an ASR system. With a dataset comprising over 500 hours of unlabeled data, the system has achieved a remarkable Word Error Rate (WER) of 14.36%, showcasing its effectiveness in transcribing court proceedings with high accuracy.

Wav2Vec-Based Self-Supervised Learning for Court Hearing Transcription

Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a teacher professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of the types of biases represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets, potentially making bias mitigation by self-debiasing harder. HESEIA is available to support bias assessments grounded in educational communities.

An intersectional bias evaluation dataset grounded in educational contexts

In this study, we investigate how author affiliation shapes academic discourse, proposing it as an effective proxy for author perspective in understanding what topics are studied, how nations are framed, and whose realities are prioritised. Using Palestine as a case study, we apply BERTopic and Structural Topic Modelling (STM) to 29,536 English-language academic articles collected from the OpenAlex database. We find that domestic authors focus on practical, local issues like healthcare, education, and the environment, while foreign authors emphasise legal, historical, and geopolitical discussions. These differences, in our interpretation, reflect lived proximity to war and crisis. We also note that while BERTopic captures greater lexical nuance, STM enables covariate-aware comparisons, offering deeper insight into how affiliation correlates with thematic emphasis. We propose extending this framework to other underrepresented countries, including a future study focused on Gaza post-October 7.

Next from EMNLP 2025

Uniandes at TSAR 2025 Shared Task: Multi-Agent CEFR Text Simplification with Automated Quality Assessment and Iterative Refinement

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES