Tibbe, Tristan Dale

A Comprehensive Comparison of Missing Data Procedures for Tree-Based Machine Learning Methods

2024

Tibbe, Tristan Dale
Advisor(s): Montoya, Amanda K

Abstract

As machine learning techniques increase in popularity among psychology researchers, decision trees---and their offshoots such as random forests and qualitative interaction trees (QUINT)---have received special attention due to their interpretability and ease of use. Although these tree-based methods are versatile in terms of the types and number of relationships they can model, missingness still needs to be addressed before they can produce predictions. The current literature comparing missing data methods available for tree-based models is fragmented, with only subsets of conditions or methods examined in each study. Furthermore, the application of missing data methods to the specialized tree-based algorithm QUINT has largely gone uninvestigated in previous research. Thus, there is a great need to clarify which missing data methods are best to use in which situations, especially with niche tree-based models like QUINT. Since tree-based methods are quick to set up and easy to interpret, it is particularly important that users who may not be experienced in machine learning or statistics receive guidance on how they should manage missingness in their data before they apply such methods. In order to consolidate the research that has already been done on missing data methods with tree-based models, this dissertation provides a summary of the existing literature on the topic, introducing the methods and factors that are important to consider when choosing how to deal with missingness in datasets for tree models. Also, to understand which missing data methods are currently applied to tree-based models in psychological studies and under which conditions, recent substantive research articles were reviewed and information about their datasets and methodologies were recorded. Finally, to extend the knowledge accumulated in the literature, a plethora of popular/modern missing data methods for tree models were applied to the QUINT algorithm and compared in a simulation study, varying factors such as the amount of missingness, the type of missingness, and where the missingness appeared in the data to cover a variety of scenarios that may arise in the real world. The results reveal that, in terms of both prediction accuracy and variable selection, imputation methods---specifically regression and hot deck imputation---and missingness incorporated in attributes (MIA) are able to address missing data and produce QUINT models of consistently high quality. The complete case method, on the other hand, should be avoided due to its highly variable performance and inconsistent nature, leading to models that differ greatly from would have been produced had the data been complete.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

A Comprehensive Comparison of Missing Data Procedures for Tree-Based Machine Learning Methods