Thailand

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26$\times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

ACL 2024

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

kv cache

efficient inference

transformers

poster

### Welcome!
The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. Our Virtual Poster Sessions will take place online Thursday, August 22, 2024.

You are required to register for this event. **Please register [here](https://2024.aclweb.org/registration). **

If you have already registered, please check your inbox for an email from Underline granting you access to ACL 2024 content.

Please register!

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) will take place in Bangkok, Thailand from August 11th to 16th, 2024. More information will be announced soon.

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting. Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM. Moreover, the proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$\times$.

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Short text clustering poses substantial challenges due to the limited amount of information provided by each text sample. Previous efforts based on dense representations are still inadequate as texts are not sufficiently segregated in the embedding space before clustering. Even though the state-of-the-art method utilizes contrastive learning to boost performance, the process of summarizing all local tokens to form a sequence representation for the whole text includes noise that may obscure limited key information. We propose Mutual Information Maximization Framework for Short Text Clustering (MIST), which overcomes the information drown-out by including a mechanism to maximize the mutual information between representations on both sequence and token levels. Experimental results across eight standard short text datasets show that MIST outperforms the state-of-the-art method in terms of Accuracy or Normalized Mutual Information in most cases.

MIST: Mutual Information Maximization for Short Text Clustering

Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Using questions in written text is an effective strategy to enhance readability. However, what makes an active reading question good, what the linguistic role of these questions is, and what is their impact on human reading remains understudied. We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles. By analyzing the dataset, we present a comprehensive understanding of the use, distribution, and linguistic characteristics of these questions. Then, we explore various approaches to generate such questions using language models. Our results highlight the importance of capturing inter-question relationships and the challenge of question position identification in generating these questions. Finally, we conduct a human study to understand the implication of such questions on reading comprehension. We find that the generated questions are of high quality and are almost as effective as human-written questions in terms of improving readers' memorization and comprehension.

How to Engage your Readers? Generating Guiding Questions to Promote Active Reading

Generics are expressions used to communicate abstractions about categories. While conveying general truths (e.g., "Birds fly"), generics have the interesting property to admit exceptions (e.g., penguins do not fly). Statements of this type help us organizing our knowledge of the world, and form the basis of how we express it (Hampton, 2012; Leslie, 2014).This study investigates how Large Language Models (LLMs) interpret generics, drawing upon psycholinguistic experimental methodologies. Understanding how LLMs interpret generic statements serves not only as a measure of their ability to abstract but also arguably plays a role in their encoding of stereotypes. Given that generics interpretation necessitates a comparison with explicitly quantified sentences, we explored i.) whether LLMs can correctly associate a quantifier with the generic structure, and ii.) whether the presence of a generic sentence as context influences the outcomes of quantifiers. We evaluated LLMs using both Surprisal distributions and prompting techniques.The findings indicate that models do not exhibit a strong sensitivity to quantification. Nevertheless, they seem to encode a meaning linked with the generic structure, which leads them to adjust their answers accordingly when a generalization is provided as context.

Quantifying Generalizations: Exploring the Divide Between Human and LLMs’ Sensitivity to Quantification

Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.

Learning to Generate Answers with Citations via Factual Consistency Models

Accurate representation of procedures in restricted scenarios, such as non-standardized scientific experiments, requires precise depiction of constraints. Unfortunately, Domain-specific Language (DSL), as an effective tool to express constraints structurally, often requires case-by-case hand-crafting, necessitating customized, labor-intensive efforts. To overcome this challenge, we introduce the AutoDSL framework to automate DSL-based constraint design across various domains. Utilizing domain specified experimental protocol corpora, AutoDSL optimizes syntactic constraints and abstracts semantic constraints. Quantitative and qualitative analyses of the DSLs designed by AutoDSL across five distinct domains highlight its potential as an auxiliary module for language models, aiming to improve procedural planning and execution.

AutoDSL: Automated domain-specific language design for structural representation of procedures with constraints

Knowledge graphs (KGs) complement Large Language Models (LLMs) by providing reliable, structured, domain-specific, and up-to-date external knowledge. However, KGs and LLMs are often developed separately and must be integrated after training. We introduce Tree-of-Traversals, a novel zero-shot reasoning algorithm that enables augmentation of black-box LLMs with one or more KGs. The algorithm equips a LLM with actions for interfacing a KG and enables the LLM to perform tree search over possible thoughts and actions to find high confidence reasoning paths. Tree-of-Traversals significantly improves performance on question answering and KG question answering tasks. Code is available at https://github.com/amazon-science/tree-of-traversals

Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs

We present the structured average intersection-over-union ratio (STRUCT-IOU), an evaluation metric that compares a constituency parse tree over automatically recognized spoken word boundaries with the ground-truth parse tree over written words. To compute the metric, we (1) project the ground-truth parse tree to the speech domain by forced alignment, (2) align the projected ground-truth constituents with the predicted ones under certain structured constraints, and (3) calculate the average IOU score across all aligned constituent pairs. STRUCT-IOU takes word boundaries into account and overcomes the challenge that the predicted words and ground truth may not have perfect one-to-one correspondence. Extending to the evaluation of text constituency parsing, we demonstrate that STRUCT-IOU shows higher tolerance to syntactically plausible parses than PARSEVAL (Black et al., 1991).

Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing

Arabic is known to present unique challenges
for Automatic Speech Recognition (ASR). On
one hand, its rich linguistic diversity and
wide range of dialects complicate the de-
velopment of robust, inclusive models. On
the other, current multilingual ASR models
are compute-intensive and lack proper com-
prehensive evaluations. In light of these
challenges, we distill knowledge from large
teacher models into smaller student variants
that more efficient. We also introduce a novel
human-annotated dataset covering five under-
represented Arabic dialects for evaluation. We
further evaluate both our models and existing
SoTA multilingual models on both standard
available benchmarks and our new dialectal
data. Our best-distilled model’s overall perfor-
mance (45.0% WER) surpasses that of a SoTA
model twice its size (SeamlessM4T-large-v2,
WER=47.0%) and its teacher model (Whisper-
large-v2, WER=55.1%), and its average perfor-
mance on our new dialectal data (56.9% WER)
outperforms all other models. To gain more in-
sight into the poor performance of these models
on dialectal data, we conduct an error analysis
and report the main types of errors the different
models tend to make. The GitHub repository
for the project is available at https://github.
com/UBC-NLP/distill-whisper-ar.

Downloads

Next from ACL 2024

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES