India

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior—even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model&#39;s built-in safety alignment and external token-level filters.

IJCNLP-AACL 2025

Large Language Models Encode Semantics and Alignment in Linearly Separable Representations

large language models

probing

interpretability

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior—even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.

poster

### Welcome to IJCNLP-AACL 2025! 
 It is a great honor to host this joint conference in Mumbai, India, from December 20 to 24, 2025. The joint conferences of IJCNLP and AACL are organized with alternating leadership in the Asia-Pacific region. The event is run by the Asian Federation of Natural Language Processing (AFNLP) in odd years, and by AACL in even years, while it is organized solely by ACL when the annual ACL meeting is held in the region. This year, the conference is primarily organized by AFNLP. 
*Kentaro Inui
MBZUAI, UAE
General Chair, IJCNLP-AACL 2025* 
Read full message and download the Conference Handbook [**here**](https://drive.google.com/file/d/1UTwxkAqSqI-GAoJC3wE1zZt5VP1Y8GX0/view?usp=sharing).

The 14th IJCNLP & 4th AACL will be held in Mumbai, India from December 20th to December 24th, 2025.

Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question–answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2–3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

Clustering customer chat data is vital for cloud providers handling multi-service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Re-clustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi-turn chats into service-specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via Davies–Bouldin Index (DBI) and Silhouette Scores, with LLM-based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real-time analytics without full re-clustering.

LLM-Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations

The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance. In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.

Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.

A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we construct a controlled setting where task complexity can be precisely manipulated to study its effect on reasoning length. Deterministic finite automata (DFAs) offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.

Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?

While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system’s effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.

An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

The recent development of LLMs is remarkable, but they still struggle to handle cross-lingual homographs effectively. This research focuses on the cross-lingual homographs between Japanese and Chinese---the spellings of the words are the same, but their meanings differ entirely between the two languages. We introduce a new benchmark dataset named Doppelganger-JC to evaluate the ability of LLMs to handle them correctly. We provide three kinds of tasks for evaluation: word meaning tasks, word meaning in context tasks, and translation tasks. Through the evaluation, we found that LLMs' performance in understanding and using homographs is significantly inferior to that of humans. We pointed out the significant issue of homograph shortcut, which means that the model tends to preferentially interpret the cross-lingual homographs in its easy-to-understand language. We investigate the potential cause of this homograph shortcut from a linguistic perspective and pose that it is difficult for LLMs to recognize a word as a cross-lingual homograph, especially when it shares the same part-of-speech (POS) in both languages. The data and code is publicly available here: \url{https://github.com/0017-alt/Doppelganger-JC.git}.

Doppelganger-JC: Benchmarking the LLMs’ Understanding of Cross-Lingual Homographs between Japanese and Chinese

We investigate factors contributing to LLM agents’ success in competitive multi-agent environments, using auctions as a testbed where agents bid to maximize profit. The agents are equipped with bidding domain knowledge, distinct personas that reflect item preferences, and a memory of auction history. Our work extends the classic auction scenario by creating a realistic environment where multiple agents bid on houses, weighing aspects such as size, location, and budget to secure the most desirable homes at the lowest prices. Particularly, we investigate three key questions: (a) How does a persona influence an agent’s behavior in a competitive setting? (b) Can an agent effectively profile its competitors’ behavior during auctions? (c) How can persona profiling be leveraged to create an advantage using strategies such as theory of mind? Through a series of experiments, we analyze the behaviors of LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a valuable platform for deepening the understanding of multi-agent workflows in competitive environments.

HARBOR: Exploring Persona Dynamics in Multi-Agent Competition

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.

Revisiting Word Embeddings in the LLM Era

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

Downloads

Next from IJCNLP-AACL 2025

ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES