Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/b5gr-8j11

poster

ACL 2024

August 12, 2024

Bangkok, Thailand

On the Effect of (Near) Duplicate Subwords in Language Modelling

keywords:

duplicate

language modelling

tokenization

vocabulary

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned random indices before being served to the LM. However, this process---while typically lossless---may lead to less efficient LM training, because it removes character-level information, thereby making it more difficult to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this, by duplicating each token in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17\% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that deduplicating them considerably hurts LM performance; but that this loss in performance can be easily mitigated.

Downloads

Transcript English (automatic)

Next from ACL 2024

Understanding and Patching Compositional Reasoning in LLMs
poster

Understanding and Patching Compositional Reasoning in LLMs

ACL 2024

+3Gangwei JiangZhaoyi Li
Zhaoyi Li and 5 other authors

12 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved