China

In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass / fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. We introduced two new LLM-as-a-Judge metrics using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement for Bash code. We used execution-based evaluation as ground-truth to evaluate our metrics. Results show high accuracy and agreement with execution-based evaluation (significant better than string similarity metrics and up to 8% over LLM metric baseline). Finally, we built Reflection code agents to utilize judgments and feedback from our evaluation metrics which achieved significant improvement (up to 24% increase in accuracy) for automatic code refinement.

EMNLP 2025

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

poster

## Welcome!
"I am excited to welcome you to this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first
workshop, which had 14 accepted papers. As the field looks ahead, Suzhou is the fitting location for celebrating this milestone: rooted in a long literary tradition, yet modern and forward-looking, and home to a large share of the NLP community."<br>

*Message from the General Chair, Dirk Hovy*

[**Link to Conference Handbook**](https://drive.google.com/file/d/1johU5QqVVYO4RfH7QcIORr7qrVBdzdwC/view?usp=sharing)





<br>

Celebrate 30 Years of EMNLP! 
EMNLP 2025 will be held in Suzhou, China from November 5th to November 9th, 2025.

We present a robust framework for deploying domain-specific language agents that can query industrial sensor data using natural language. Grounded in the Reasoning and Acting (ReAct) paradigm, our system introduces three key innovations: (1) integration of the \texttt{Self-Ask} method for compositional, multi-hop reasoning; (2) a multi-agent architecture with \texttt{Review}, \texttt{Reflect} and \texttt{Distillation} components to improve reliability and fault tolerance; and (3) a long-context prompting strategy leveraging curated in-context examples, which we call \textit{Tiny Trajectory Store}, eliminating the need for fine-tuning. We apply our method to Industry 4.0 scenarios, where agents query SCADA systems (e.g., SkySpark) using questions such as, “How much power did B002 AHU 2-1-1 use on 6/14/16 at the POKMAIN site?” To enable systematic evaluation, we introduce \textbf{IoTBench}, a benchmark of 400+ tasks across five industrial sites. Our experiments show that ReAct-style agents enhanced with long-context reasoning (\texttt{ReActXen}) significantly outperform standard prompting baselines across multiple LLMs including smaller models. This work repositions NLP agents as practical interfaces for industrial automation, bridging natural language understanding and sensor-driven environments.

ReAct Meets Industrial IoT: Language Agents for Data Access

Trends on microblogs often transcend linguistic boundaries, evolving into global phenomena with significant societal and economic impact. We address the novel task of predicting "cross-lingual trends" on microblog that originate in one language and subsequently become popular in others. While crucial for global monitoring and marketing, this area has been under-explored due to the challenge of cross-lingual trend identification. We introduce a methodology to overcome this by automatically constructing a dataset using Wikipedia's inter-language links to reconcile trend names. We propose a prediction model that leverages a rich feature set, including not only temporal frequency but also microblog content and external knowledge signals from Wikipedia (e.g., article content, pageviews). Our approach achieves higher accuracy than existing trend prediction methods and LLM-based approaches, enabling effective early detection of cross-lingual trends.

Predicting Cross-lingual Trends in Microblogs

The increasing complexity of modern driving systems demands efficient collection and analysis of specific driving scenarios that are crucial for system development and validation. Current approaches either rely on massive data collection followed by manual filtering, or rigid threshold-based recording systems that often miss important edge cases. In this paper, we present Distributed Adaptive Scene Recognition (DASR), a novel multi-agent cloud-edge framework for language-guided scene detection in connected vehicles. Our system leverages the complementary strengths of cloud-based large language models and edge-deployed vision language models to intelligently identify and preserve relevant driving scenarios while optimizing limited on-vehicle buffer storage. The cloud-based LLM serves as an intelligent coordinator that analyzes developer prompts to determine which specialized tools and sensor data streams should be incorporated, while the edge-deployed VLM efficiently processes video streams in real time to make relevant decisions. Extensive experiments across multiple driving datasets demonstrate that our framework achieves superior performance compared to larger baseline models, with exceptional performance on complex driving tasks requiring sophisticated reasoning. DASR also shows strong generalization capabilities on out-of-distribution datasets and significantly reduces storage requirements (28.73 %) compared to baseline methods.

DASR: Distributed Adaptive Scene Recognition - A Multi-Agent Cloud-Edge Framework for Language-Guided Scene Detection

For an e-commerce domain, the customer
address is the single most important piece
of customer data for ensuring accurate
and reliable deliveries. In this two-part
study, we first outline the construction of
a language model to assist customers with
address standardization and in the latter
part, we detail a novel Pareto-ensemble
multi-task prediction algorithm that derives critical insights from customer addresses to minimize operational losses 
arising from a given geographical area. Finally, 
we demonstrate the potential benefits of
the proposed address intelligence system
for a large e-commerce domain through
large scale experiments on a commercial
system.

An Address Intelligence Framework for E-commerce Deliveries

Adapting language models to learn continuously from data streams while retaining previous knowledge is a key challenge in artificial intelligence (AI), particularly in lifelong language learning. Existing distillation methods are based on offline techniques, limiting their ability to update in real-time and adapt to dynamic environments. To address this, we propose online dynamic mutual distillation - a novel framework that enables continuous mutual learning from task streams without relying on domain-specific teachers. To our knowledge, this is the first application of mutual learning in lifelong language learning, providing dynamic knowledge transfer without domain-specific teachers. Moreover, our extensive experiments demonstrate that the proposed method reduces catastrophic forgetting, while improving task performance on various benchmark datasets making it suitable for real-world, dynamic natural language processing (NLP) applications such as adaptive chatbots and personalized language systems. We will make our code publicly available upon acceptance.

L4: Mutual Learning Helps Lifelong Language Learning

The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model’s generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost‑efficient evaluation: (i) multi‑criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain‑adaptive transfer learning, in which we fine‑tune a 2B‑parameter VLM on synthetic judgments in a chart dataset to create the \textbf{ChartJudge}. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA‑Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Cyberbullying (CB) involves complex relational dynamics that are often oversimplified as a binary classification task. Existing youth-focused CB datasets rely on scripted role-play, lacking conversational realism and ethical youth involvement, with little or no evaluation of their social plausibility. To address this, we introduce a \textbf{youth-in-the-loop} dataset ``\textbf{BullyBench}'' developed by adolescents (ages 15–16) through an ethical co-research framework. We introduce a structured \textbf{intrinsic} quality evaluation with \textbf{experts-in-the-loop} (social scientists, psychologists, and content moderators) for assessing realism, relevance, and coherence in youth CB data. Additionally, we perform \textbf{extrinsic} baseline evaluation of this dataset by benchmarking encoder- and decoder-only language models for multi-class CB role classification for future research. A three-stage annotation process by young adults refines the dataset into a gold-standard test benchmark, a high-quality resource grounded in minors’ lived experiences of CB detection. Code and data are available for review \footnote{\url{https://github.com/youthcodesign/emnl-industry-track-submission} Please note, labels for BullyBench will be made available after review.}.

BullyBench: Youth & Experts-in-the-loop Framework for Intrinsic and Extrinsic Cyberbullying NLP Benchmarking

Rapid developments of large language models have revolutionized many NLP tasks on English data, unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: *Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages?* In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020 with a focus on transformer-based models, such as BERT, T5, and GPT. We present advances and gaps across 3 essential aspects: data, model, and tasks, such as available data sources, fine-tuning strategies, and domain applications. Our findings highlight substantial issues, such as missing data in critical domains (e.g., health), code-mixing, and missing standardized evaluation. Our survey will raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages.

Bhaasha, Bhāṣā, Zaban: A Survey for Low-Resourced Languages in South Asia – Current Stage and Challenges

Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the \textbf{first} parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria—Cultural Proximity, Cultural Neutrality, and Cultural Genuineness—to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.

Premium content

Next from EMNLP 2025

ReAct Meets Industrial IoT: Language Agents for Data Access

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES