China

Automatic readability assessment plays a key role in ensuring effective communication between humans and language models. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 1.2k judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across 5 datasets, contrasting them with 5 more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top 4 in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 7.8. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

EMNLP 2025

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

human judgments

bert-based metric

model-based metrics

llm-as-a-judge

automatic readability assessment

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

This work presents a system for on-device text simplification that enables users to process sensitive text without relying on cloud-based services. Through the use of quantization techniques and a novel approach to controllable text simplification, we reduce model size by up to 75% with minimal performance degradation. Our models demonstrate efficient state-of-the-art results using a synthetic dataset of 2,909 examples, outperforming prior work trained on 300K examples. This efficiency stems from: (1) a single control token strategy that precisely targets specific reading levels, (2) a contrastive training approach that enriches model understanding through exposure to multiple simplification levels, and (3) individual models that dedicate full parameter capacity to specific reading level transformations. Our best models achieve up to 82.18 BLEU (at the Advanced level) and 46.12 SARI (at the Elementary level) on standard benchmarks, with performance preserved even after aggressive quantization. This work is implemented as a collaboration with the Mozilla AI team to processes text entirely locally, ensuring sensitive information never leaves the user's device. We have a demonstration video (https://youtu.be/TzmaxnARMzg) and a web demo available at: (https://pablorom2004.github.io/Simplification-Web-Demo)

Efficient On-Device Text Simplification for Firefox with Synthetic Data Fine-Tuning

In this contribution to the CEFR level simplification TSAR 2025 Shared Task, we propose two systems, EZ-SCALAR and SAGA, that implement two differing approaches to prompting LLMs for proficiency-adapted simplification. Our results place us in the middle of the participating teams, and reveal that using external lexical resources to guide simplification improves overall results.

GRIPF at TSAR 2025 Shared Task: Towards controlled CEFR level simplification with the help of inter-model interactions

Automatic Text Simplification (TS) makes complex texts more accessible but often lacks control over target readability levels. We propose a lightweight, prompt-based approach to English TS that explicitly aligns outputs with CEFR proficiency standards. Our method employs a three-stage pipeline, guided by rule-informed prompts inspired by expert strategies. In the TSAR 2025 Shared Task, our system achieved competitive performance, with stronger results at B1 level and challenges at A2 level due to over-simplification. These findings highlight the promise of prompt-based CEFR-oriented simplification and the need for more flexible constraint design.

ITU NLP at TSAR 2025 Shared Task: A Three-Stage Prompting Approach for CEFR-Oriented Text Simplification

taskGen ranked as 6th best team in the TSAR shared task for English text adaptation to a target CEFR level. Our approaches consisted of prompting a Llama-3.1-8B-Instruct model with linguistic descriptors of the target level, examples of adaptations and multi-step approaches. Our best approach, 13th in the overall ranking, applied an ensemble approach with a voting mechanism to find the most adequate among 11 texts, each produced by a different prompt to the same model.

taskGen at TSAR 2025 Shared Task: Exploring prompt strategies with linguistic knowledge

This paper presents an approach to automated text simplification for CEFR A2 and B1 levels using large language models and prompt engineering.
We evaluate seven models across three prompting strategies: short, descriptive, and descriptive with examples. A two-round evaluation system using LLM-as-a-Judge and traditional metrics for text simplification determines optimal model-prompt combinations for final submissions. Results demonstrate that descriptive prompts consistently outperform other strategies across all models, achieving 46-65\% of first-place rankings. Qwen3 shows superior performance for A2-level simplification, while B1-level results are more balanced across models. The LLM-as-a-Judge evaluation method shows strong alignment with traditional metrics while providing enhanced explainability.

EasyJon at TSAR 2025 Shared Task: Evaluation of Automated Text Simplification with LLM-as-a-Judge

We present an agent-based system for the TSAR 2025 Shared Task on Readability-Controlled Text Simplification, which requires simplifying English paragraphs from B2+ levels to target A2 or B1 levels while preserving meaning. Our approach employs specialized agents for keyword extraction, text generation, and evaluation, coordinated through an iterative refinement loop. The system integrates a CEFR vocabulary classifier, pretrained evaluation models, and few-shot learning from trial data. Through iterative feedback between the evaluator and writer agents, our system automatically refines outputs until they meet both readability and semantic preservation constraints. This architecture achieved Xth position among participating teams, showing the effectiveness of combining specialized LLMs with automated quality control strategies for text simplification.


Uniandes at TSAR 2025 Shared Task: Multi-Agent CEFR Text Simplification with Automated Quality Assessment and Iterative Refinement

This paper describes our submissions to the TSAR 2025 Shared Task on Readability-Controlled Text Simplification. We present a comparative study of three architectures: a minimal rule-based baseline, an expert-enhanced system, and a multi-stage generative pipeline using a T5 model in a zero-shot setting. Because per-instance official scores were not available at the time of analysis, we perform a principled sensitivity analysis via simulated paired bootstrap to assess robustness of our comparative claims. Under a wide range of reasonable assumptions the simpler, more constrained systems show substantially better automatic scores for semantic fidelity and the composite AUTORANK metric. We include diagnostic failure analysis grounded in actual system outputs, discuss limitations of embedding-based guardrails, and provide concise reproducibility notes in the Appendix. Full code, experimental configurations, and outputs will be released upon acceptance to ensure complete reproducibility.


HOPE at TSAR 2025 Shared Task: Balancing Control and Complexity in Readability-Controlled Text Simplification

This paper describes the system submission of our team OUNLP to the TSAR-2025 shared task on readability-controlled text simplification. Based on the analysis on \Naive Prompting-based method on text simplification, we discovered an interesting finding that the performance of text simplification is highly related the gap between source CERF~\cite{arase2022cefr} level and target CERF level. Inspired by this finding, we propose to two multi-round simplification methods: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint), which are all generated with AI. Our system ranked 7 over 20 teams. Later improvements demonstrates that AI generated code with robust evaluation metrics for verification are promising methods to produce reliable, readability-controlled text simplifications~\footnote{\url{https://github.com/Rickie2k6/Sentence_Simplification}}.

OUNLP at TSAR 2025 Shared Task: AI-Generated Multi-Round Sentence Simplifier

Text simplification is an active research topic with applications in multiple domains. In a simplification pipeline, assessment of text difficulty plays a crucial role as a quality control mechanism: it acts as a ``critic,'' and guides models to generate text at the difficulty level that is required by the user. In this paper, we present our Difficulty-aware Text Simplification System. We evaluate our pipeline using the TSAR shared task dataset and discuss challenges in constructing corpora for training models for assessment of text difficulty.

Know-AI at TSAR 2025 Shared Task: Difficulty-aware Text Simplification System

Vision-Language Models (VLMs) often appearculturally competent but rely on superficial pat.tern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themedcultural imagery through both classification andexplanation analysis. Testing multiple modelson Western festivals, non-Western traditions.and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresentedcultural events, frequently offering vague labelsor dangerously misclassifying emergencies ascelebrations. These failures expose the risksof symbolic shortcuts and highlight the needfor cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodalsystems.

Next from EMNLP 2025

Efficient On-Device Text Simplification for Firefox with Synthetic Data Fine-Tuning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES