As the advancement of data collection through internet platforms, biomedical tests and genetic sequencing, individual and heterogeneous datasets have been prevalent in science discovery. However, traditional homogeneous or population-wise statistical and machine learning tools are incapable of analyzing such complex datasets. In this thesis, we propose novel statistical models and machine learning models to incorporate heterogenous information and apply these methods in crowdsourcing, individualized treatment rules and effective data integration.
In the first topic of the thesis, we propose a new statistical model to incorporate heterogeneity in crowdsourcing problems. In the past two decades, crowdsourcing has emerged as a solution for collecting large scale labels and has been impactful in many data collection procedures, such as the Imagenet. However, aggregating crowdsourced labels could be challenging due to the heterogeneity of task difficulty and worker ability. We propose a two-stage model to predict the true labels for multicategory classification tasks in crowdsourcing. In the first stage, we fit the observed labels with a latent factor model with subgrouping structures for both tasks and workers to accommodate their heterogeneity. Group-specific rotations are introduced to align workers with different task categories to solve multicategory crowdsourcing tasks. In the second stage, we propose a concordance-based approach to identify high-quality worker subgroups who are relied upon to assign labels to tasks. In theory, we show the estimation consistency of the latent factors and the prediction consistency of the proposed method. The simulation studies and real data application show that the proposed method outperforms the existing competitive methods.
In the second topic, we propose a novel statistical model to learn individualized treatment rules with combination treatments. The individualized treatment rule (ITR), which recommends an optimal treatment based on individual characteristics, has drawn considerable interest from many areas such as precision medicine, personalized education, and personalized marketing. Existing ITR estimation methods mainly adopt one of two or several treatments. However, a combination of multiple treatments could be more powerful in various areas. We propose a novel Double Encoder Model (DEM) to estimate the individualized treatment rule for combination treatments. The proposed double encoder model is a nonparametric model which not only flexibly incorporates complex treatment effects and interaction effects among treatments, but also improves estimation efficiency via the parameter-sharing feature. In addition, we tailor the estimated ITR to budget constraints through a multi-choice knapsack formulation. In theory, we provide the value reduction bound with or without budget constraints, and an improved convergence rate with respect to the number of treatments under the DEM. Our simulation studies show that the proposed method outperforms the existing ITR estimation in various settings. We also demonstrate the superior performance of the proposed method in PDX data that recommends optimal combination treatments to shrink the tumor size of the colorectal cancer.
In the third topic, we introduces a novel ITR estimation method for combination treatments incorporating interaction effects among treatments. Specifically, we propose the generalized $\psi$-loss as a non-convex surrogate in the residual weighted learning framework, offering desirable statistical and computational properties. Statistically, the minimizer of the proposed surrogate loss is Fisher-consistent with the optimal decision rules, incorporating interaction effects at any intensity level - a significant improvement over existing methods. Computationally, the proposed method applies the difference-of-convex algorithm for efficient computation. Through simulation studies and real-world data applications, we demonstrate the superior performance of the proposed method in recommending combination treatments.