Thailand

While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the “reversal curse”. It is a typical example that the model knows “A’s father is B”, but is unable to reason “B’s child is A”. This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models’ ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works.

ACL 2024

Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training

reversal curse

large language models

poster

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

This paper investigates the inherent knowledge in language models from the perspective of epistemological holism. The purpose of this paper is to explore whether LLMs exhibit characteristics consistent with epistemological holism. These characteristics suggest that core knowledge, such as commonsense, general, and specific knowledge, each plays a specific role, serving as the foundation of our knowledge system and being difficult to revise. To assess these traits related to holism, we created a scientific reasoning dataset and examined the epistemology of language models through three tasks: Abduction, Revision, and Argument Generation. In the abduction task, the language models explained situations while avoiding revising the core knowledge. However, in other tasks, the language models were revealed not to distinguish between core and peripheral knowledge, showing an incomplete alignment with holistic knowledge principles.

Epistemology of Language Models: Do Language Models Have Holistic Knowledge?

Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.

On Context Utilization in Summarization with Large Language Models

Reference summaries for abstractive speech summarization require human annotation, which can be performed by listening to an audio recording or by reading textual transcripts of the recording. In this paper, we examine whether summaries based on annotators listening to the recordings differ from those based on annotators reading transcripts. Using existing intrinsic evaluation based on human evaluation, automatic metrics, LLM-based evaluation, and a retrieval-based reference-free method, we find that summaries are indeed different based on the source modality, and that speech-based summaries are more factually consistent and information-selective than transcript-based summaries. Transcript-based summaries are impacted by recognition errors in the source, and expert-written summaries are more informative and reliable. We make all the collected data and analysis code public to facilitate the reproduction of our work and advance research in this area.

Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?

Minimum Bayes Risk (MBR) decoding is a text generation technique that has been shown to improve the quality of machine translations, but is expensive, even if a sampling-based approximation is used. Besides requiring a large number of sampled sequences, it requires the pairwise calculation of a utility metric, which has quadratic complexity. In this paper, we propose to approximate pairwise metric scores with scores calculated against aggregated reference representations. This changes the complexity of utility estimation from $O(n^2)$ to $O(n)$, while empirically preserving most of the quality gains of MBR decoding. We release our source code.

Linear-time Minimum Bayes Risk Decoding with Reference Aggregation

Story detection in online communities is a challenging task as stories are scattered across communities and interwoven with non-storytelling spans within a single text. We address this challenge by building and releasing the StorySeeker toolkit, including a richly annotated dataset of 502 Reddit posts and comments, a detailed codebook adapted to the social media context, and models to predict storytelling at the document and span levels. Our dataset is sampled from hundreds of popular English-language Reddit communities ranging across 33 topic categories, and it contains fine-grained expert annotations, including binary story labels, story spans, and event spans. We evaluate a range of detection methods using our data, and we identify the distinctive textual features of online storytelling, focusing on storytelling spans, which we introduce as a new task. We illuminate distributional characteristics of storytelling on a large community-centric social media platform, and we also conduct a case study on r/ChangeMyView, where storytelling is used as one of many persuasive strategies, illustrating that our data and models can be used for both inter- and intra-community research. Finally, we discuss implications of our tools and analyses for narratology and the study of online communities.

Where Do People Tell Stories Online? Story Detection Across Online Communities

Dialogue summarization is an important task that requires to generate highlights for a conversation from different aspects (e.g., content of various speakers). While several studies successfully employ large language models (LLMs) and achieve satisfying results, they are limited by using one model at a time or treat it as a black box, which makes it hard to discriminatively learn essential content in a dialogue from different aspects, therefore may lead to anticipation bias and potential loss of information in the produced summaries. In this paper, we propose an LLM-based approach with role-oriented routing and fusion generation to utilize mixture of experts (MoE) for dialogue summarization. Specifically, the role-oriented routing is an LLM-based module that selects appropriate experts to process different information; fusion generation is another LLM-based module to locate salient information and produce finalized dialogue summaries. The proposed approach offers an alternative solution to employing multiple LLMs for dialogue summarization by leveraging their capabilities of in-context processing and generation in an effective manner. We run experiments on widely used benchmark datasets for this task, where the results demonstrate the superiority of our approach in producing informative and accurate dialogue summarization.

Dialogue Summarization with Mixture of Experts based on Large Language Models

Low-resource languages (LRLs) face challenges in supervised neural machine translation (NMT) due to limited parallel data, prompting research in unsupervised NMT.
Unsupervised NMT (UNMT), without requiring ground truth, provides solutions for LRL translations using synthetic pseudo-parallel data and parallel data from auxiliary language pairs. However, they usually encounter translation errors, including errors from synthetic data and from auxiliary language pairs with linguistic biases.
We argue that large language models (LLMs) mitigate UNMT's translation errors by dynamically organizing auxiliary languages in prompts to improve LRL translations. In this paper, we propose $\textbf{P}$r$\textbf{O}$bability-driven $\textbf{M}$eta-graph $\textbf{P}$rompter (POMP), an approach employing a dynamic graph to organize multiple auxiliary languages, to prompt LLMs in LRL translations. 
POMP proposes a language-specific meta-graph that dynamically samples multiple translation paths to organize auxiliary languages in constructing prompts. Following the path, POMP prompts LLMs to translate with a mixture of auxiliary languages. We achieve the meta-graph's evolution by back-propagating evaluation scores to update probabilities on the graph.
Our experimental improvements show POMP's effectiveness on LRLs' translation.

POMP: Probability-driven Meta-graph Prompter for LLMs in Low-resource Unsupervised Neural Machine Translation

Large Language Models (LLMs) have shown strong generalization abilities to excel in various tasks, including emotion support conversations. However, deploying such LLMs like GPT-3 (175B parameters) is resource-intensive and challenging at scale. In this study, we utilize LLMs as "Counseling Teacher" to enhance smaller models' emotion support response abilities, significantly reducing the necessity of scaling up model size. To this end, we first introduce an iterative expansion framework, aiming to prompt the large teacher model to curate an expansive emotion support dialogue dataset. This curated dataset, termed ExTES, encompasses a broad spectrum of scenarios and is crafted with meticulous strategies to ensure its quality and comprehensiveness. Based on this, we then devise a Diverse Response Inpainting (DRI) mechanism to harness the teacher model to produce multiple diverse responses by filling in the masked conversation context. This richness and variety serve as instructive examples, providing a robust foundation for fine-tuning smaller student models. Experiments across varied scenarios reveal that the teacher-student scheme with DRI notably improves the response abilities of smaller models, even outperforming the teacher model in some cases. The dataset and codes are available in https://github.com/pandazzh2020/ExTES.

Self-chats from Large Language Models Make Small Emotional Support Chatbot Better

This work studies two types of ambiguity in natural language: prepositional phrase (PP) attachment ambiguity, and garden path constructions. Due to the different nature of these ambiguities -- one being structural, the other incremental in nature -- we pretrain and evaluate the Tree Transformer of Wang et al. (2019), an unsupervised Transformer model that induces tree representations internally. 
To assess PP attachment ambiguity we inspect the model's induced parse trees against a newly prepared dataset derived from the PP attachment corpus (Ratnaparkhi et al., 1994). Measuring garden path effects is done by considering surprisal rates of the underlying language model on a number of dedicated test suites, following Futrell et al. (2019). For comparison we evaluate a pretrained supervised BiLSTM-based model trained on constituency parsing as sequence labelling (Gómez-Rodríguez and Vilares, 2018). Results show that the unsupervised Tree Transformer does exhibit garden path effects, but its parsing ability is far inferior to the supervised BiLSTM, and it is not as sensitive to lexical cues as other large LSTM models, suggesting that supervised parsers based on a pre-Transformer architecture may be the better choice in the presence of ambiguity.

Tree Transformer’s Disambiguation Ability of Prepositional Phrase Attachment and Garden Path Effects

Journalists regularly make decisions on whether or not to report stories, based on "news values". In this work, we wish to explicitly model these decisions to explore _when_ and _why_ certain stories get press attention. This is challenging because very few labelled links between source documents and news articles exist and language use between corpora is very different. We address this problem by implementing a novel _probabilistic relational modeling_ framework, which we show is a low-annotation linking methodology that outperforms other, more state-of-the-art retrieval-based baselines. Next, we define a new task: __newsworthiness prediction__, to predict if a policy item will get covered. We focus on news coverage of local public policy in the San Francisco Bay Area by the _San Francisco Chronicle_. We gather 15k policies discussed across 10 years of public policy meetings, and transcribe over 3,200 hours of public discussion. In general, we find limited impact of public discussion on newsworthiness prediction accuracy, suggesting that some of the most important stories barely get discussed in public.
Finally, we show that newsworthiness predictions can be a useful assistive tool for journalists seeking to keep abreast of local government. We perform human evaluation with expert journalists and show our systems identify policies they consider newsworthy with 68\% F1 and our coverage recommendations are helpful with an 84\% win-rate against baseline. We release all code and data to our work here: https://github.com/alex2awesome/newsworthiness-public.

Premium content

Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training

Downloads

Next from ACL 2024

Epistemology of Language Models: Do Language Models Have Holistic Knowledge?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES