EMNLP 2025

November 05, 2025

Suzhou, China

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Understanding the relationship between training data and model behavior during pretraining is crucial but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub (https://github.com/aflah02/tokensmith), with accompanying documentation and tutorials (https://aflah02.github.io/tokensmith/). A demonstration video is also available on YouTube (https://www.youtube.com/watch?v=cDO8VE9fZvU)

Downloads

PaperTranscript English (automatic)

Next from EMNLP 2025

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
demo

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

EMNLP 2025

+9
Elliott Ash and 11 other authors

05 November 2025

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved