
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt & Bowman, 2022; Warstadt et al.,2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child’s linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child-directed dataset. We build upon previous research by conducting systematic tests on 5 datasets, comprising single and aggregated child data and a web corpus, using six different model architectures, including Transformers, to investigate whether the results of what is learnable from single-child input observed in previous studies are consistent across different model architectures and datasets. We find that models trained on three single-child datasets demonstrate consistent results, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of linguistic input specific to an individual child.
Authors:
Yulu Qin: New York University; Wentao Wang: New York University; Brenden Lake: NYU
