Date of Award
2021
Embargo Period
8-1-2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Public Health Sciences
College
College of Graduate Studies
First Advisor
Dongjun Chung
Second Advisor
Brian Neelon
Third Advisor
Stephen P. Ethier
Fourth Advisor
Kristin Wallace
Fifth Advisor
Feifei Xiao
Abstract
In recent years, comprehensive cancer genomics platform, such as The Cancer Genome Atlas (TCGA), provides access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alteration, DNA methylation, and somatic mutation. Currently most existing analysis approaches focused only on gene-level analysis and suffered from limited interpretability and low reproducibility of findings. Additionally, with increasing availability of the modern compositional data including immune cellular fraction data and high-dimensional zero-inflated microbiome data, variable selection techniques for compositional data became of great interest because they allow inference of key immune cell types (immunology data) and key microbial species (microbiome data) associated with development and progression of various diseases. In the first dissertation aim, we address these challenges by developing a Bayesian sparse latent factor model for pathway-guided integrative genomic data analysis. Specifically, we constructed a unified framework to simultaneously identify cancer patient subgroups (clustering) and key molecular markers (variable selection) based on the joint analysis of continuous, binary and count data. In addition, we applied Polya-Gamma mixtures of normal for binary and count data to promote an exact and fully automatic posterior sampling. Moreover, pathway information was used to improve accuracy and robustness in identification of cancer patient subgroups and key molecular features. In the second dissertation aim, we developed the R package "InGRiD", a comprehensive software for pathway-guided integrative genomic data analysis. We further implemented the statistical model developed in Aim 1 and provide it as a part of this software. The third dissertation aim exploits variable selection in compositional data analysis with application to immunology data and microbiome data. Specifically, we identified key immune cell types by applying a stepwise pairwise log-ratio procedure to the immune cellular fractions data, while selecting key species in the microbiome data by using zero-inflated Wilcoxon rank sum test. These approaches consider key components specific to these data types, such as compositionality (i.e., sum-to-one), zero inflation, and high dimensionality, among others. The proposed methods were developed and evaluated on: 1) large scale, high dimensional, and multi-modal datasets from the TCGA database, including gene expression, DNA copy number alteration, and somatic mutation data (Aim 1); 2) cellular fraction data induced from Colorectal Adenocarcinoma TCGA Pan-Cancer study (Aim 3); 3) high dimensional zero-inflated microbiome data from studies of colorectal cancer (Aim 3).
Recommended Citation
Sun, Zequn, "Statistical Methods for Integrative Analysis, Subgroup Identification, and Variable Selection Using Cancer Genomic Data" (2021). MUSC Theses and Dissertations. 643.
https://medica-musc.researchcommons.org/theses/643
Rights
All rights reserved. Copyright is held by the author.