Date of Award

2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Public Health Sciences

College

College of Graduate Studies

First Advisor

Bethany J. Wolf

Second Advisor

Viswanathan Ramakrishnan

Third Advisor

Paul J. Nietert

Fourth Advisor

Diane Kamen

Fifth Advisor

Paula Ramos

Abstract

Many diseases have complex etiologies arising from interactions among genetic and environmental factors [1]. If an increased risk of disease is due to interactions between factors rather than a single factor alone, identification of the risk factors associated with the disease outcome can be difficult to detect using traditional statistical methods. For example, using a traditional logistic regression approach, interactions should be selected a priori, and sufficient data must be available in order to develop a model including interactions and their associated main effects. Also, if attempting to evaluate all possible interactions, the number of terms to include in logistic regression grows exponentially. In contrast, decision tree methods do not require identification of interactions a priori, and they can handle large numbers of variables. Classification and Regression Trees (CART) is a popular decision tree method, but it is biased toward the inclusion of continuous variables in the model [2]. It also can not exactly capture certain combinations of variables. Logic regression, an alternative decision tree method designed to find interactions among binary variables using Boolean logic, is able to identify exact interactions that are difficult to identify using CART. This dissertation extends logic regression methodology to allow for the inclusion of continuous variables within the logic regression framework. In order to do this, we first investigate which methods for dichotomization of a continuous variable to discriminate a binary outcome are the most effective for identifying the true threshold of a continuous variable, given one exists. Dichotomization methods are regularly used for patient risk stratification and in some statistical applications, for example to simplify interpretation of results; thus, it is important to know which dichotomization methods successfully identify a true threshold [3–9]. If the interaction of two or more variables, rather than their main effects lead to an increased risk of disease, then dichotomizing the variables in the interaction term individually may obscure the association with disease outcome, Y, making it more difficult to find the true thresholds. Thus, we also develop a method for jointly dichotomizing two or more variables to discriminate a dichotomous outcome in the case where the interaction between the variables are associated with outcome, Y. We also use the dichotomization methods proven to theoretically recover a true threshold to develop an algorithm called C.Logic that allows for the inclusion of continuous variables in the logic regression framework. Specific Aim 1: Evaluate the ability of different methods of dichotomizing a continuous variable to discriminate a binary outcome to recover the true threshold, given one exists. a. Theoretically show which methods of dichotomization should recover a true threshold, T. b. Compare the methods theoretically shown to identify a true threshold when sampling from a population. The performance of each method will be measured by examining the Mean Squared Error (MSE) and bias of the estimated threshold. Specific Aim 2: Develop an algorithm for joint thresholding two or more continuous variables to discriminate a binary outcome. a. Theoretically support the argument for using joint thresholding of interacting variables as opposed to single thresholding. b. Compare joint thresholding to single thresholding when sampling from a population. The performance of each method will be measured by examining MSE and bias. Specific Aim 3: Develop an algorithm, C.Logic, that allows for the inclusion of continuous variables and their interactions within the logic regression framework. a. Compare the performance of C.Logic to CART for identifying continuous variables and their interactions that are associated with a binary outcome. The performance will be measured by how many times an interaction is exactly identified in the model.

Rights

All rights reserved. Copyright is held by the author.

Share

COinS