Skip to main content
eScholarship
Open Access Publications from the University of California

Automatic Estimation of Lexical Concreteness in 77 Languages

Abstract

We estimate lexical Concreteness for millions of wordsacross 77 languages. Using a simple regression framework,we combine vector-based models of lexical semantics withexperimental norms of Concreteness in English and Dutch.By applying techniques to align vector-based semantics acrossdistinct languages, we compute and release Concreteness esti-mates at scale in numerous languages for which experimentalnorms are not currently available. This paper lays out thetechnique and its efficacy. Although this is a difficult datasetto evaluate immediately, Concreteness estimates computedfrom English correlate with Dutch experimental norms at ρ= .75 in the vocabulary at large, increasing to ρ = .8 amongNouns. Our predictions also recapitulate attested relationshipswith word frequency. The approach we describe can be readilyapplied to numerous lexical measures beyond Concreteness.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View