India

The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs&#39; input/output text through various detectors. However, developing and maintaining robust detectors has many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs before deployment. In this work, we propose STAR, a simple yet intuitive solution to generate production-like labeled data for LLMs&#39; guardrails development. STAR is based on two key ideas: (i) using self-automated back-querying to synthetically generate data, paired with (ii) a sparse human-in-the-loop clustering technique to label the data. The aim of self-automated back-querying is to construct a parallel corpus roughly representative of the original dataset and resembling real LLM output. We then infuse existing datasets with our synthetically generated examples to produce robust training data for our detectors. We test our technique on one of the most difficult and nuanced detectors: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.48%, despite having 400x less parameters.

IJCNLP-AACL 2025

STAR: Self-Automated Back-Querying for Production Data Generation

healthcare applications

human-centered evaluation

human-in-the-loop

data augmentation

The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors has many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs before deployment. In this work, we propose STAR, a simple yet intuitive solution to generate production-like labeled data for LLMs' guardrails development. STAR is based on two key ideas: (i) using self-automated back-querying to synthetically generate data, paired with (ii) a sparse human-in-the-loop clustering technique to label the data. The aim of self-automated back-querying is to construct a parallel corpus roughly representative of the original dataset and resembling real LLM output. We then infuse existing datasets with our synthetically generated examples to produce robust training data for our detectors. We test our technique on one of the most difficult and nuanced detectors: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.48%, despite having 400x less parameters.

poster

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Aligned large language models (LLMs) are vulnerable to jailbreaks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based attacks, there are no defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SemanticSmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SemanticSmooth achieves strong robustness against both manually constructed jailbreak prompts and automatic jailbreak attacks like GCG, PAIR, and PromptRS while maintaining strong nominal performance on standard LLM evaluation benchmarks such as AlpacaEval for the instruction-following tasks and PiQA for the question-answering tasks.

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Large Language Models (LLMs) have recently made significant advancements in tackling complex tasks, such as retrieving hard-to-find information and solving intricate problems. Consequently, various approaches have been proposed to integrate LLMs into recommender systems, primarily by embedding them within existing architectures or training them on the recommendation data. However, most existing methods fail to effectively incorporate user-item interaction signals into pretrained LLMs due to the modality gap between interaction data and the LLM’s internal knowledge. To address this challenge, we propose the Item-Language Model (ILM) to enhance LLMs for recommendation. ILM consists of two main components: An item-language representation learning module, where an ILM encoder is pretrained to generate text-aligned item representations. And an item-language co-training module, where the ILM encoder is integrated into a pretrained LLM for the recommendation tasks. Extensive experiments demonstrate the superior performance of our approach over several state-of-the-art methods, validating the importance of text-aligned item representations in bridging this modality gap. Our ablation studies further reveal the effectiveness of our model design for integrating the interaction knowledge into LLMs for recommendation tasks. Our code is available at: https://anonymous.4open.science/r/ILM-7AD4/.

Item-Language Model: Improving Large Language Model for Recommendation via Item-Language Representation Learning

Large language models (LLMs) have demonstrated strong capabilities in simulating social roles and generating human-like behaviors. However, their effectiveness in predicting real-world user behavior under continuous memory accumulation remains largely unexplored. Most existing studies focus on short-term interactions or static personas, neglecting the dynamic nature of users' historical experiences in social media environments. To address this gap, we introduce FineRob, a novel dataset for fine-grained behavior prediction of social media users, which includes long-term memory traces from 1,866 users across three platforms. Each behavior is decomposed into three elements: object, type, and content, resulting in 78.6k QA records.We identify that as memory accumulates, prediction accuracy drops significantly due to the model's difficulty in accessing detailed historical information. We further propose the OM-CoT fine-tuning framework to enhance the model's ability to process and utilize long-term memory. Experimental results show that our method effectively reduces the performance degradation caused by memory growth, improving fine-grained behavior prediction. \footnote{Code and dataset are available at \url{https://anonymous.4open.science/r/FineRob-791B/}}.

LLM-Based Behavior Prediction for Social Media Users with Continuous Memory

Chemical molecules can be represented as graphs or as language descriptions. Training unimodal models on graphs results in different encodings than training them on language. Therefore, the existing literature force-aligns the unimodal models during training to use them in downstream applications such as drug discovery. But to what extent are \textit{graph} and \textit{language} unimodal model representations inherently aligned, i.e., aligned prior to any force-alignment training? Knowing this is useful for a more expedient and effective forced-alignment. For the first time, we explore methods to gauge the alignment of graph and language unimodal models. We find compelling differences between models and their ability to represent slight structural differences without force-alignment. We also present an \underline{u}nified \underline{u}nimodal \underline{a}lignment (\textbf{U2A}) benchmark for gauging the inherent alignment between graph and language encoders which we make available with this paper\footnote{GitHub link: \href{https://github.com/caocongfeng/U2A.git}{U2A Benchmark Repository}}.

How Aligned Are Unimodal Language and Graph Encodings of Chemical Molecules?

Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer's concerns but also improve their work. This raises the critical question of how to enhance authors' comprehension of review comments. In this paper, we present SEAGraph a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author’s thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors. By bridging the gap between reviewers’ critiques and authors’ comprehension, SEAGraph contributes to a more efficient, transparent, and collaborative scientific publishing ecosystem. Our code is available at https://anonymous.4open.science/r/seagraph/.

SEAGraph: Unveiling the Whole Story of Paper Review Comments

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving {code changes which have a structurally complex and context-dependent format of code remains largely unexplored.} This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code {change} to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.

Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

Although the use of persona prompting in large language models appears to trigger different styles of generated text, it is unclear whether these translate into measurable behavioral differences. Furthermore, little work has studied whether these differences, when they do exist, can affect decision-making in an adversarial strategic environment. We investigate the impact of persona prompting on strategic performance in PERIL, a world domination board game. Specifically, we compare the effectiveness of persona-derived heuristics to those chosen manually. Our findings reveal that personality traits intuitively associated with strategic thinking do appear to improve game performance, but only when an additional mediator is used to translate personas into heuristic values. We introduce this mediator as a structured translation process, inspired by exploratory factor analysis, that maps LLM-generated inventory responses into strategic heuristics. Results indicate our method enhances heuristic reliability and face validity when compared to directly inferred heuristics, allowing us to better study the effect of persona types on decision-making behaviors. These insights advance our understanding of how persona prompting influences LLM-based decision-making and propose a novel heuristic generation method that adds to the growing body of work applying psychometric principles to LLMs.

Do Persona-Infused LLMs Affect Performance in a Strategic Reasoning Game?

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years seems to bring us close enough to the ``last exam'' for LLMs to surpass humanity. However, is the LLMs' remarkable reasoning ability indeed coming from true intelligence by human standards, or are they actually reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to reevaluate the true intelligence level of cutting-edge LLMs.

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

This paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call *label length bias*, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose *normalized contextual calibration* (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10\% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

Downloads

Next from IJCNLP-AACL 2025

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES