EMNLP 2025

November 07, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora, for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these languages, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora for rare languages much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

Downloads

SlidesPaperTranscript English (automatic)

Next from EMNLP 2025

One More Modality: Does Abstract Meaning Representation Benefit Visual Question Answering?
poster

One More Modality: Does Abstract Meaning Representation Benefit Visual Question Answering?

EMNLP 2025

Abhidip Bhattacharyya and 2 other authors

07 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved