Date of Award
2021
Embargo Period
8-1-2024
Document Type
Dissertation - MUSC Only
Degree Name
Doctor of Philosophy (PhD)
Department
Public Health Sciences
College
College of Graduate Studies
First Advisor
Bethany Wolf
Second Advisor
Paul Nietert
Third Advisor
Diane Kamen
Fourth Advisor
John Pearce
Abstract
Advances in high-throughput technologies and the increasing availability of large- scale patient electronic health record (EHR) data provide unique opportunities to develop prediction algorithms for personalized medicine. These opportunities depend on the integration of the most relevant subset of features that enhance predictive ability by reducing the amount of random noise caused by unimportant features that increase the model’s chances of overfitting and computational costs. Identifying relevant features may be challenging when prediction models are difficult to optimize due to large numbers of potential features that are potentially collinear. To address this challenge, both traditional regression methods and machine learning methods can model a binary outcome and provide quantitative or semi-quantitative measures of feature importance. In this study, we evaluated how available feature importance algorithms for different prediction models performed in the presence of multicollinearity to provide guidance for selecting from among these approaches given strength of correlation between features and the dimensionality of the data. We conducted an extensive simulation study to examine the impact of multicollinearity and dimensionality on the ability of different feature importance algorithms to correctly identify features as important or arbitrary. Our results indicate that for linear and non-linear relationships between features and the outcome, LASSO and elastic net provide the most consistent ranking in low-dimensional data where the number of observations is far greater than the number of features. However, as dimensionality increases such that the number of features increases relative to the number of observations, feature importance algorithms in machine learning approaches such as support vector ma- chines (SVMs), artificial neural networks (ANNs), and random forests (RF) become more ideal approaches.
Recommended Citation
Keller, Everette, "Variable Importance Performance when Multicollinearity Is Present" (2021). MUSC Theses and Dissertations. 584.
https://medica-musc.researchcommons.org/theses/584
Rights
All rights reserved. Copyright is held by the author.