This dissertation explored whether unnecessary linguistic complexity (LC) in mathematics and biology assessment items changes the direction and significance of differential item functioning (DIF) between subgroups emergent bilinguals (EBs) and English proficient students (EPs). Due to inconsistencies in measuring LC in items, Study One adapted a rubric counting instances of specific grammatical features in items and introduced a method for evaluating lexical features in items. Four raters were asked to count the presence of five grammatical features in assessment items and determine whether each feature contained construct-relevant vocabulary. The items were drawn from four content assessments administered to Massachusetts high school students: two biology assessments and two mathematics assessments. These counts of grammatical and lexical features were modeled in factor analyses to evaluate the multidimensionality of LC and subsequent fit of multidimensional LC models. While there were problems with raters consistently counting construct-irrelevant grammatical features, multidimensional models of LC fit acceptably well. Factor scores obtained from the measurement models for lexical complexity, relative clauses, and complex noun phrases created in Study One were used for Study Two. In Study Two, Rasch hierarchical generalized linear models (HGLMs) were created to evaluate DIF between different subgroups of EBs and EPs on a biology assessment and a mathematics assessment, as including LC as an item covariate may predict item responses differently by comparison group. Seven comparison groups were evaluated across two assessments (mathematics and biology): EPs versus EBs, EPs versus short-term EBs, EPs versus long-term EBs, short-term EBs versus long-term EBs, EPs versus Spanish-speaking EBs, EPs versus non-Spanish-speaking EBs, and non-Spanish-speaking EBs versus Spanish-speaking EBs (reference group versus focal group, respectively). For each comparison group, at least five models were created: a comparison model with all participants in the comparison group with that only accounts for the main effect of focal group status, a “base model” that evaluated DIF for the comparison groups with no LC item covariates, a model including lexical complexity as an item covariate (“LEX predictor”), a model including complex noun phrases as an item covariate (“NP predictor”), and a model including relative clauses as an item covariate (“RC predictor”). If LC predictor models improved model fit, models with multiple LC predictors were created.
For the EP versus EB comparison groups on the mathematics assessment, model fit only improved with the NP predictor model, while the LEX, NP, and RC predictor models improved model fit for the EB versus EB comparison groups; a model with all LC predictors improved model fit for the EB versus EB comparison groups. For the biology assessment, the LEX, NP, and RC predictor models improved model fit for all comparison groups; a model with all LC predictors improved model fit for all comparison groups. The main effects of the item covariates (LC factor scores) and their interactions with focal group status were evaluated, as were the number of items within a comparison group that had changes in DIF significance or direction when including a LC predictor. All LC predictors had consistent main effects across comparison groups. For the mathematics assessment, items with higher complex noun phrases factor scores were consistently more difficult for all comparison groups (NP predictor model), and items with higher lexical complexity (LEX predictor model, all predictors model) or relative clauses factor scores (RC predictor model, all predictors model) were consistently more difficult for all EB versus EB comparison groups. For the biology assessment and all comparison groups, items with higher lexical complexity (LEX predictor model, all predictors model) or complex noun phrases factor scores (NP predictor model, all predictors model) were consistently more difficult, and items with lower relative clauses factor scores (RC predictor model, all predictors model) were consistently more difficult, with one exception. In the all predictors models for the EB versus EB comparison groups, only relative clauses had a significant main effect.
There were some changes in interactions with LC predictors and focal group status. For the mathematics assessment and EP versus EB comparison groups, complex noun phrases interactions favored EPs. For the mathematics assessment and EB versus EB comparison groups, generally the interactions in the single LC predictor models generally favored STEBs compared to LTEBs and non-Spanish-speaking EBs compared to Spanish-speaking EBs, but when all LC predictors were included, no interactions between LC predictor and focal group status were significant. For the biology assessment and EP versus EB comparison groups, lexical complexity and complex noun phrases factor scores interactions generally favored EPs, and relative clauses factor scores interactions favored EBs and EB subgroups. For the biology assessment and EB versus EB comparison groups, regardless of whether examining the single LC predictor or all predictors models, no interactions between focal group status and LC predictor were significant.
Changes in DIF significance and direction were compared between the base model and LC predictor models for all comparison groups. For the mathematics assessment and EP versus EB comparison groups, after conditioning on complex noun phrases, items with complex noun phrases generally exhibited significant DIF favoring EBs, regardless of whether the complex noun phrases factor scores were high (one standard deviation above the mean) or low (due to floor effects, the lowest complex noun phrases factor score). For the biology assessment, all items exhibited significant DIF favoring EBs after accounting for lexical complexity, most items exhibited non-significant DIF after accounting for complex noun phrases or relative clauses, and items were mixed between exhibiting non-significant DIF or significant DIF favoring EBs after accounting for all LC predictors. While items with high relative clauses factor scores exhibited non-significant DIF, some items with low relative clauses factor scores exhibited significant DIF favoring EPs after accounting for relative clauses. Items with two or more high factor scores exhibited non-significant DIF, but items with two or more low factor scores exhibited significant DIF favoring EBs after accounting for all LC predictors. These results were fairly consistent across different EP versus EB comparison groups, although different items were flagged for DIF in initial models not accounting for LC predictors. Items were less difficult for EBs than EPs after accounting for LC features, which suggests the abilities of EBs are underestimated due to LC in items, even if the items have low LC. Considering subgroup differences in these EIRMs, the key takeaway is that while different items are flagged as exhibiting significant DIF for different EP versus EB comparison groups when examining DIF with no LC predictors, there are few subgroup differences in items changing DIF significance or direction after accounting for LC predictors.