- Main
Imputation is a Hyperparameter: Imputation Deep Learning Model Selection and Evaluation on Large Clinical Datasets
- Zamanzadeh, Davina
- Advisor(s): Sarrafzadeh, Majid
Abstract
Many real-world datasets suffer from missing data, which can introduce uncertainty into ensuing analyses. To address missing data, researchers have been developing, analyzing, and comparing statistical and machine learning techniques for missing data estimation or imputation. In this context, we built an original framework, Autopopulus, and performed novel analyses that explored predictive pipelines using flexible autoencoder-led imputation. Our work examines autoencoder-led imputation with a deeper regard for the taxonomy of missingness scenarios and mixed feature data of large real-world clinical datasets. In this dissertation we quantify, in a direct manner, the extent to which different methods of imputation affect downstream tasks, and therefore provide rationale for how to choose a solution for a particular dataset and task. We illuminate important decision-making points when assembling a data processing pipeline that handles missing data, while our framework itself allows researchers to apply and compare solutions directly in a unified way for any large dataset. We find that there are different imputation traits under a more granular classification of missingness scenarios, and that trends between imputation performance superiority and predictive performance superiority do not align. Based on our exploration, we believe that the characterization of missingness in the literature must be expanded and that imputing accurately is not always necessary for predicting accurately. We are just beginning to have a clearer view of just how wide the gap is in our understanding and classification of missingness and have hope that this new information will lead to progress in comprehending both the unknown and the unknowable.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-