Arnett, Catherine

A Linguistic Approach to Crosslingual and Multilingual NLP

2024

Arnett, Catherine
Advisor(s): Ackerman, Farrell

Abstract

Language models work well in English, but in just about every other language, they work much worse. In this dissertation, I use theories and methods from linguistics and psycholinguistics to contribute to the understanding of how language models work for different languages and how they work in multilingual settings. As languages differ greatly in how they encode information, many researchers have asked to what extent those crosslinguistic differences impact language model performance. I investigate the role of training data size and tokenizers in those differences. I find that cross-linguistic differences which have been described in terms of typological features can instead be attributed to differences in effective dataset size.In multilingual settings, language models may use some of the same representations to encode information for multiple languages. This allows for efficient usage of the models' parameters, while also yielding benefits for the models' ability to generalize across and between languages. I use a psycholinguistic experimental paradigm, crosslinguistic structural priming, to probe these shared representations and characterize how and when models learn these representations. These results also contribute to our understanding of how bilingual people use shared representations to store information about multiple languages.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

A Linguistic Approach to Crosslingual and Multilingual NLP