India

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

IJCNLP-AACL 2025

DSBC : Data Science task Benchmarking with Context engineering

context engineering

agents

benchmarks

data science

poster

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present \textbf{FarSense}, a 6‑task benchmark for Farsi covering True/False judgment,  multiple-choice questions, Explanation, Cause‑Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‑Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‑speaker annotations on outputs from five major LLMs—GPT‑4o, Gemini-2.5-Flash, Mistral-Large, Qwen‑Plus, and DeepSeek‑Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation.  Using \textbf{FarSense}, we measured the commonsense ability of the same five flagship LLMs and also fine‑tuned six compact models (1B–24B parameters) before re‑evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‑tuning yields substantial gains, confirming \textbf{FarSense} as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.

FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language

The rapid growth of social networks has led to a surge in social bots, which often disseminate low-quality content and may manipulate public opinion, posing threats to online security. Although recent GNN-based bot detection methods perform strongly, they still face two major challenges. First, deep GNNs are prone to over-smoothing: neighbor aggregation blends bot and human node representations, obscuring bot-specific features. Second, social graphs are dominated by human–human and human–bot connections, while direct bot–bot links are scarce, making it difficult for effective bot representations to propagate within GNNs. To address these issues, we propose a Topology-Aware Gated Graph Neural Network (\framework) to detect social bots. \framework~ employs topology-aware data augmentation to synthesize realistic bot nodes that preserve the original graph structure, mitigating class imbalance; it also introduces a hierarchical gating mechanism that restructures node embeddings into a tree format, selectively filtering noise and enhancing discriminative features. Experiments on three standard benchmark datasets show that \framework consistently surpasses leading baselines in highly imbalanced settings, delivering superior accuracy and robustness.

Topology-Aware Gated Graph Neural Network for Social Bot Detection

Using a sample of $25,000$ Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77\% of conversations) which correlates with positive user experience regardless of the user's level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between users and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.

Speaking the Right Language: The Impact of Expertise (Mis)Alignment in User-AI Interactions

Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.

Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking

Social media has become a crucial platform for information dissemination and opinion expression. The massive and continuous generation of user content has given rise to various natural language processing tasks, such as sentiment analysis and topic classification. However, existing mainstream approaches typically focus on modeling individual tasks in isolation, lacking systematic exploration of collaborative modeling across multiple tasks. This neglects the inherent correlations among social media tasks, thereby limiting the model’s ability to fully comprehend and exploit the rich, multi-dimensional semantic information embedded in text. To address this challenge, we propose $\textbf{Ta}$sk-adaptive $\textbf{C}$ontrastive $\textbf{L}$earning with $\textbf{Co}$operative $\textbf{M}$ixture $\textbf{o}$f $\textbf{E}$xperts ($\textbf{TaCL-CoMoE}$), a unified framework for social media multi-task learning. Specifically, we improve the gating mechanism by replacing the traditional softmax routing with sigmoid activation, enabling cooperative selection among multiple experts and mitigating the ``expert monopoly'' phenomenon. In addition, we introduce a task-adaptive contrastive learning strategy to further enhance the model’s ability to capture and distinguish semantic structures across different tasks. Experimental results on multiple public social media datasets demonstrate that TaCL-CoMoE consistently achieves state-of-the-art (SOTA) performance. The code is available at https://github.com/wxr2847/TaCL-CoMoE.

TaCL-CoMoE: Task-adaptive Contrastive Learning with Cooperative Mixture of Experts for Multi-task Social Media Analysis

Radiology report summarization (RRS) is critical for clinical workflows, requiring concise Impressions "distilled from detailed Findings." This paper proposes a novel prompting strategy that enhances RRS by introducing a layperson summary as an intermediate step. This summary helps normalize key observations and simplify complex terminology using communication techniques inspired by doctor–patient interactions. Combined with few-shot in-context learning, this approach improves the model’s ability to map generalized descriptions to specific clinical findings. We evaluate our method on three benchmark datasets, MIMIC-CXR, CheXpert, and MIMIC-III, and compare it against state-of-the-art open-source language models in the 7B/8B parameter range, such as Llama-3.1-8B-Instruct. Results show consistent improvements in summarization quality, with gains of up to 5% on some metrics for prompting, and more than 20% for some models when instruction tuning.

Simplified Rewriting Improves Expert Summarization

User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M²PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs' attachment on external knowledge.

Improving LLM’s Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.

ASAudio: A Survey of Advanced Spatial Audio Research

Multimodal foundation models (MFMs) have demonstrated strong capabilities in e-commerce by effectively leveraging multimodal data to enhance product understanding and user experienceHowever, the development of e-commerce MFMs is hindered by two challenges: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods in e-commerce. To address these challenges, we introduce MMECInstruct, the first large-scale, high-quality multimodal instruction dataset designed specifically for e-commerce MFMs. MMECInstruct comprises 75,000 samples covering 7 real-world e-commerce tasks, supporting both in-domain (IND) and out-of-domain (OOD) evaluations. Leveraging MMECInstruct, we develop CASLIE, a lightweight framework that enhances multimodal information understanding and integration for e-commerce. Our comprehensive evaluation demonstrates that MMECInstruct endows CASLIE with advanced capability and strong generalizability in e-commerce applications. MMECInstruct and CASLIE models are publicly accessible through https://github.com/ninglab/CASLIE.

Downloads

Next from IJCNLP-AACL 2025

FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES