- Main
Crowd-Sourcing Human Ratings of Linguistic Production
Abstract
This study examines the reliability and validity of using two types of crowd-sourced judgments to collect lexical diversity scores. Scaled and pairwise comparison approaches were used to collect data from non-expert Amazon Mechanical Turk workers. The reliability of the lexical diversity ratings for the crowd-sourced raters was assessed along with those from trained raters using a variety of reliability statistics. The validity of the ratings was examined by 1) comparing crowd-sourced and trained ratings, 2) comparing crowd-sourced and trained ratings to ratings of language proficiency, and 3) by using an objective measure of lexical diversity to predict the crowd-sourced and trained ratings. The results indicate that scaled crowd-sourced ratings showed strong reliability in terms of text and rater strata and showed fewer misfitted texts than the trained raters. The scaled crowd-sourced ratings were also strongly predicted by lexical diversity features derived from the texts themselves.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-