Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/kvys-5n51

workshop paper

ACL 2024

August 15, 2024

Bangkok, Thailand

Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese

keywords:

luw

bunsetsu

historical japanese

chunking

tokenization

In Japanese, the natural minimal phrase of a sentence is the "bunsetsu" and it serves as a natural boundary of a sentence for native speakers rather than words, and thus grammatical analysis in Japanese linguistics commonly operates on the basis of bunsetsu units. In contrast, because Japanese does not have delimiters between words, there are two major categories of word definition, namely, Short Unit Words (SUWs) and Long Unit Words (LUWs). Though a SUW dictionary is available, LUW is not. Hence, this study focuses on providing deep learning-based (or LLM-based) bunsetsu and Long Unit Words analyzer for the Heian period (AD 794-1185) and evaluating its performances. We model the parser as transformer-based joint sequential labels model, which combine bunsetsu BI tag, LUW BI tag, and LUW Part-of-Speech (POS) tag for each SUW token. We train our models on corpora of each period including contemporary and historical Japanese. The results range from 0.976 to 0.996 in f1 value for both bunsetsu and LUW reconstruction indicating that our models achieve comparable performance with models for a contemporary Japanese corpus. Through the statistical analysis and diachronic case study, the estimation of bunsetsu could be influenced by the grammaticalization of morphemes.

Downloads

SlidesTranscript English (automatic)

Next from ACL 2024

UFCNet: Unsupervised Network based on Fourier transform and Convolutional attention for Oracle Character Recognition
workshop paper

UFCNet: Unsupervised Network based on Fourier transform and Convolutional attention for Oracle Character Recognition

ACL 2024

+3
Yanan Zhou and 5 other authors

15 August 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved