China

In this paper, we report our participation to the PalmX cultural evaluation shared task. Our system, CultranAI, focused on data augmentation and LoRA fine-tuning of large language models (LLMs) for Arabic cultural knowledge representation. We benchmarked several LLMs to identify the best-performing model for the task. In addition to utilizing the PalmX dataset, we augmented it by incorporating the Palm dataset and curated a new dataset of over 22K culturally grounded multiple-choice questions (MCQs). Our experiments showed that the Fanar-1-9B-Instruct model achieved the highest performance. We fine-tuned this model on the combined augmented dataset of 22K+ MCQs. On the blind test set, our submitted system ranked 5th with an accuracy of 70.50%, while on the PalmX development set, it achieved an accuracy of 84.1%.

EMNLP 2025

CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation

culturally informed

native

arabic llms

contextual understanding

foundation models

llms

language diversity

minority languages

augmentation

large language models

multilingual

workshop paper

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

Cultural and religious cues are essential
for understanding; absence in LLMs → skewed outputs and unfair
outcomes.
Limited evaluation: Existing benchmarks rarely probe
culture-aware or Islam-centric competencies.

AYA at PalmX 2025: Modeling Cultural and Islamic Knowledge in LLMs

This paper presents our system for Subtask 1: Islamic Inheritance Reasoning in the QIAS 2025 Shared Task, which evaluates large language models (LLMs) on ʿilm al-mawārīth (the Islamic science of inheritance) using a benchmark of Arabic multiple-choice questions (MCQs) derived from expert-reviewed fatwas. We explore static and dynamic few-shot prompting, retrieval-augmented generation (RAG) using a large fatwa corpus, and a progressive n-gram overlap retrieval method. The n-gram method is employed both to select the top five most similar MCQs for dynamic prompting and to retrieve the most relevant fatwa answers as additional context during inference. We evaluate both proprietary and open-source LLMs individually and in ensemble configurations. Results show that dynamic prompting and RAG consistently improve accuracy across models, with our best-performing model, Gemini, achieving 62.26% accuracy on the test set.

SHA at the QIAS Shared Task: LLMs for Arabic Islamic Inheritance Reasoning

Word sense disambiguation is the task of selecting a word's applicable word sense in a given context. However, ambiguous texts may lack the information necessary to disambiguate words completely, resulting in multiple word senses with varying degrees of plausibility. We design a dataset around this premise: Our samples consist of 4--5 sentence short stories, where the sentence with the word to be disambiguated is itself ambiguous and surrounding sentences only contain indirect clues towards the more plausible word sense. We collect annotations from humans who rate the plausibility of a given word sense on a scale from 1--5. In total, our dataset contains 19,701 human word sense annotations on 1,899 stories. We investigate the performance of large language models on our data and find that many poorly correlate with human judgments. We also find that fine-tuning on our data can increase performance.

AmbiStory: A Challenging Dataset of Lexically Ambiguous Short Stories

With the introduction of large language models, NLP has undergone a paradigm shift where these models now serve as the backbone of most developed systems. However, while highly effective, they remain opaque and difficult to interpret, which limits their adoption in critical applications that require transparency and trust. Two major approaches aim to address this: rationale extraction, which highlights input spans that justify predictions, and concept bottleneck models, which make decisions through human-interpretable concepts. Yet each has limitations. Crucially, current models lack a unified framework that connects where a model looks (rationales) with why it makes a decision (concepts). We introduce CLARITY, a model that first selects key input spans, maps them to interpretable concepts, and then predicts using only those concepts. This design supports faithful, multi-level explanations and allows users to intervene at both the rationale and concept levels. CLARITY, achieves competitive accuracy while offering improved transparency and controllability.

Connecting Concept Layers and Rationales to Enhance Language Model Interpretability

Cross-lingual Extractive Question Answering (EQA) extends standard EQA by requiring models to find answers in passages written in languages different from the questions. The Generalized Cross-Lingual Transfer (G-XLT) task evaluates models' zero-shot ability to transfer question answering capabilities across languages using only English training data. While previous research has primarily focused on scenarios where answers are always present, real-world applications often encounter situations where no answer exists within the given context. This paper introduces an enhanced G-XLT task definition that explicitly handles unanswerable questions, bridging a critical gap in current research. To address this challenge, we present two new datasets: miXQuAD and MLQA-IDK, which address both answerable and unanswerable questions and respectively cover 12 and 7 language pairs. Our study evaluates state-of-the-art large language models using fine-tuning, parameter-efficient techniques, and in-context learning approaches, revealing interesting trade-offs between a smaller fine-tuned model's performance on answerable questions versus a larger in-context learning model's capability on unanswerable questions. We also examine language similarity patterns based on model performance, finding alignments with known language families.

Cross-Lingual Extractive Question Answering with Unanswerable Questions

Readability-controlled text modification aims to rewrite an input text so that it reaches a target level of difficulty. This task is closely related to automatic readability assessment (ARA) since, depending on the difficulty level of the input text, it may need to be simplified or complexified. Most previous research in LLM-based text modification has focused on zero-shot prompting, without further input from ARA or guidance on text spans that most likely require revision. This paper shows that ARA models for texts and sentences, as well as predictions of text spans that should be edited, can enhance performance in readability-controlled text modification.

Enhancing Readability-Controlled Text Modification with Readability Assessment and Target Span Prediction

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts.
Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. We explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models---Diffusion Classifier, CLIP, and ViLT---on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at [link redacted for anonymity].

Evaluating Compositional Generalisation in VLMs and Diffusion Models

This paper presents a systematic evaluation of nearest neighbors in a range of semantic spaces across textual and visual modalities.
Focusing on the abstractness-concreteness continuum, we define an overlap measure to compare concepts differing in their linguistic vs. perceptual nature, and indeed find that alignment is primarily determined by modality and concreteness: Models from the same modality show stronger alignment than cross-modal models, and spaces of concrete concepts show stronger alignment than those of abstract ones.

Evaluating Textual and Visual Semantic Neighborhoods of Abstract and Concrete Concepts

In this work, we investigate the relationship between the quality of explanations produced by different models and the amount of implicit knowledge the are able to provide beyond the input. We approximate explanation quality via accuracy on a downstream task with a standardized pipeline (GEISER) and study its correlation with three different association measures, each capturing different aspects of implicitness, defined as a combination of relevance and novelty. We conduct experiments with three SOTA LLMs on four tasks involving implicit knowledge, with explanations either confirming or contradicting the correct label. Our results demonstrate that providing quality explanations consistently improves the accuracy of LLM predictions, even when the models are not explicitly trained to take explanations as input, and underline the correlation between implicit content delivered by the explanation and its effectiveness.

Explanations explained. Influence of Free-text Explanations on LLMs and the Role of Implicit Knowledge

Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people's opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce $\texttt{FrameNews-PT}$, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework.
Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data.
Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general 'fallback' frames. We conclude that cross-cultural frame use requires careful consideration.

Downloads

Next from EMNLP 2025

AYA at PalmX 2025: Modeling Cultural and Islamic Knowledge in LLMs

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES