China

Hate speech detection is a socially sensitive yet inherently subjective task, where individual judgments can vary widely based on personal traits. While recent work has explored how socio-demographic factors shape annotation behavior, the role of personality in Large Language Models (LLMs) remains underexplored. In this paper, we present the first comprehensive study of persona prompt in hate speech classification, focusing on MBTI-based personas. We begin with a human annotation survey demonstrating that MBTI traits significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source LLMs with MBTI personas and evaluate their responses across three hate speech datasets. Our analysis reveals substantial persona-induced shifts, including inconsistencies with ground truth, disagreement across personas, and logit-level biases. These findings highlight the importance of defining persona prompt in LLM-based annotation tasks, with implications for model fairness and alignment with human values.

EMNLP 2025

Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Recent progress in Natural Language Processing (NLP) has driven the creation of Large Language Models (LLMs) capable of tackling a vast range of tasks. A critical property of these models is their ability to handle large documents and process long token sequences, which has fostered the need for a robust evaluation methodology for long-text scenarios. To meet this requirement in the context of the Russian language, we present our benchmark consisting of 18 datasets designed to assess LLM performance in tasks such as information retrieval, knowledge extraction, machine reading, question answering, and reasoning. These datasets are categorized into four levels of complexity, enabling model evaluation across context lengths up to 128k tokens. To facilitate further research, we provide open-source datasets, a codebase, and a public leaderboard associated with the benchmark.

Long Context Benchmark for the Russian Language

This study aims to enhance the automatic identification and classification of metadiscourse markers in English texts, evaluating various large language models for the purpose. Metadiscourse is a commonly used rhetorical strategy in both written and spoken language to guide addressees through discourse. Due to its linguistic complexity and dependency on the context, automated metadiscourse classification is challenging. With a hypothesis that LLMs may handle complicated tasks more effectively than supervised machine learning approaches, we tune and evaluate seven encoder language models on the task using a dataset totalling 575,541 tokens and annotated with 24 labels. The results show a clear improvement over supervised machine learning approaches as well as an untuned Llama3.3-70B-Instruct baseline, with XLNet-large achieving an accuracy and F1-score of 0.91 and 0.93, respectively. However, four less frequent categories record F-scores below 0.5, highlighting the need for more balanced data representation.

Enhancing the Automatic Classification of Metadiscourse in Low-Proficiency Learners' Spoken and Written English Texts Using XLNet

The ability to track entities is fundamental for language understanding, yet the internal mechanisms governing this capability in Small Language Models (SLMs) are poorly understood. Previous studies often rely on indirect probing or complex interpretability methods, leaving a gap for lightweight diagnostics that connect model behavior to performance. To bridge this gap, we introduce a framework to analyze entity tracking by measuring the attention flow between entity and non-entity tokens within SLMs. We apply this to analyze models both before and after Parameter-Efficient Fine-Tuning (PEFT). Our analysis reveals two key findings. First, SLMs' attentional strategies vary significantly with text type, but entities consistently receive a high degree of focus. Second, we show that PEFT -- specifically QLoRA -- dramatically improves classification performance on entity-centric tasks by increasing the model's attentional focus on entity-related tokens. Our work provides direct evidence for how PEFT can refine a model's internal mechanisms and establishes attention analysis as a valuable, lightweight diagnostic tool for interpreting and improving SLMs.

Entity Tracking in Small Language Models: An Attention-Based Study of Parameter-Efficient Fine-Tuning

This paper investigates stance detection on Nigerian 2023 election tweets by comparing transformer-based and classical machine learning models. A balanced dataset of 2,100 annotated tweets was constructed, and BERT-base-uncased was fine-tuned to classify stances into Favor, Neutral, and Against. The model achieved 98.1% accuracy on an 80/20 split and an F1-score of 96.9% under 5-fold cross-validation. Baseline models such as Naïve Bayes, Logistic Regression, Random Forest, and SVM were also evaluated, with SVM achieving 97.6% F1. While classical methods remain competitive on curated datasets, BERT proved more robust in handling noisy, sarcastic, and ambiguous text, making it better suited for real-world applications in low-resource African NLP contexts.

Stance Detection on Nigerian 2023 Election Tweets Using BERT: A Low-Resource Transformer-Based Approach

Code-switching (CSW) in speech is motivated by conversational factors across levels of linguistic analysis. While we know much about why speakers code-switch, there remains great scope for exploring how CSW occurs in speech, particularly within the discourse-level linguistic context. We build on prior work by asking: how are patterns of CSW influenced by different conversational contexts spanning Academic, Cultural, Personal, and Professional discourse topics? To answer this, we annotate a Mandarin-English spontaneous speech corpus, and analyze its discourse topics alongside various aspects of CSW production. We show that discourse topics interact significantly with utterance-level CSW, resulting in distinctive patterns of CSW presence, richness, language direction, and syntax that are uniquely associated with different contexts. Our work is the first to take such a context-sensitive approach to studying CSW, contributing to a broader understanding of the discourse topics that motivate speakers to code-switch in diverse ways.

Code-switching in Context: Investigating the Role of Discourse Topic in Bilingual Speech Production

Discourse adverbials are key features of discourse coherence, but their function is often ambiguous. In this work, we investigate how the discourse function of otherwise varies in different contexts. We revise the function set in Rohde et al. (2018b) to account for a new meaning we have encountered. In turn, we create the "otherwise" corpus, a dataset of naturally occurring passages annotated for discourse functions, and identify lexical signals that make a function available with a corpus study. We define continuation acceptability, a metric based on surprisal to probe language models for what they take the function of otherwise to be in a given context. Our experiments show that one can improve function inference by focusing solely on tokens up to and including the head verb of the continuation (i.e., otherwise clause) that have the most varied surprisal across function-disambiguating discourse markers. Lastly, we observe that some of these tokens confirm lexical signals we found in our earlier corpus study, which provides some promising evidence to motivate future pragmatic studies in language models

"Otherwise" in Context: Exploring Discourse Functions with Language Models

This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker's thought before articulating the utterance. We expect "persona LLMs" trained with TAU-augmented data can mimic the speaker's personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers' Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM's performance.

Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user experience and task performance remains understudied. To investigate the effects of expertise-based personalization, we first built a version of an enterprise AI assistant with passive personalization. We then conducted a user study where participants completed timed exams, aided by the two versions of the AI assistant. Preliminary results indicate that passive personalization helps reduce task load and improve assistant perception, but reveal task-specific limitations that can be addressed through providing more user agency. These findings underscore the importance of combining active and passive personalization to optimize user experience and effectiveness in enterprise task-oriented environments.

Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking

A prominent issue in aligning language models (LMs) to personalized preferences is underspecification-- the lack of information from users about their preferences. A popular trend of injecting such specification is adding a prefix (e.g. prior relevant conversations) to the current user's conversation to steer preference distribution. Most methods passively model personal preferences with prior example preferences pairs. We ask whether models benefit from actively inferring preference descriptions, and address this question by creating a synthetic personalized alignment dataset based on famous people with known public preferences. We then test how effective finetuned 1-8B size models are at inferring and aligning to personal preferences. Results show that higher-quality active prefixes lead to better generalization, more contextually faithful models, and less systematic biases across different protected attributes. All our results suggest active alignment can lead to a more controllable and efficient path for personalized alignment.

Is Active Persona Inference Necessary for Aligning Small Models to Personal Preferences?

As Large Language Models (LLMs) are increasingly used to simulate human opinions and reasoning, a key challenge lies in evaluating whether their reasoning accurately reflects the beliefs of individuals from diverse demographic backgrounds. Building on prior work in alignment assessment, we hope to provide a more extensive analysis to evaluate alignment methods across different model types. In this work, we analyze multiple strategies to induce alignment in instruct and reasoning models. Using the OpinionQA dataset, we construct 500 demographic personas and associated axioms, and assess LLM performance across multiple prompting setups. Our findings reveal trade-offs between model output correctness and reasoning faithfulness, highlighting key asymmetries in how LLMs simulate belief-grounded reasoning and the current gaps that exist in alignment in LLMs.

Premium content

Next from EMNLP 2025

Long Context Benchmark for the Russian Language

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES