Date of Award

2021

Embargo Period

8-1-2024

Document Type

Dissertation - MUSC Only

Degree Name

Doctor of Philosophy (PhD)

Department

Public Health Sciences

College

College of Graduate Studies

First Advisor

Bethany Wolf

Second Advisor

Paul Nietert

Third Advisor

Diane Kamen

Fourth Advisor

John Pearce

Abstract

Advances in high-throughput technologies and the increasing availability of large- scale patient electronic health record (EHR) data provide unique opportunities to develop prediction algorithms for personalized medicine. These opportunities depend on the integration of the most relevant subset of features that enhance predictive ability by reducing the amount of random noise caused by unimportant features that increase the model’s chances of overfitting and computational costs. Identifying relevant features may be challenging when prediction models are difficult to optimize due to large numbers of potential features that are potentially collinear. To address this challenge, both traditional regression methods and machine learning methods can model a binary outcome and provide quantitative or semi-quantitative measures of feature importance. In this study, we evaluated how available feature importance algorithms for different prediction models performed in the presence of multicollinearity to provide guidance for selecting from among these approaches given strength of correlation between features and the dimensionality of the data. We conducted an extensive simulation study to examine the impact of multicollinearity and dimensionality on the ability of different feature importance algorithms to correctly identify features as important or arbitrary. Our results indicate that for linear and non-linear relationships between features and the outcome, LASSO and elastic net provide the most consistent ranking in low-dimensional data where the number of observations is far greater than the number of features. However, as dimensionality increases such that the number of features increases relative to the number of observations, feature importance algorithms in machine learning approaches such as support vector ma- chines (SVMs), artificial neural networks (ANNs), and random forests (RF) become more ideal approaches.

Recommended Citation

Keller, Everette, "Variable Importance Performance when Multicollinearity Is Present" (2021). MUSC Theses and Dissertations. 584.
https://medica-musc.researchcommons.org/theses/584

Rights

Download

COinS

MUSC Theses and Dissertations

Variable Importance Performance when Multicollinearity Is Present

Date of Award

Embargo Period

Document Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Rights

Browse

Search

Author Corner

MUSC Theses and Dissertations

Variable Importance Performance when Multicollinearity Is Present

Author

Date of Award

Embargo Period

Document Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Rights

Share

Browse

Search

Author Corner