International Biometric Conference 2020
Statistical integration of multi-omics data on Multiple System Atrophy
Said El Bouhaddani1, Hae-Won Uh1, Jeanine J. Houwing-Duistermaat1,2
1Biostatistics and Research Support, University Medical Center Utrecht (UMCU), Utrecht, Netherlands. 2Department of Statistics, University of Leeds, Leeds, United Kingdom.
Multiple System Atrophy (MSA) is a rare neurodegenerative disorder. Almost 80% of patients are disabled within 5 years of disease onset. The key pathogenic event when developing MSA is an abnormal accumulation of harmful proteins. Molecular causes and consequences of this aggregation need to be elucidated, for example using multiple omics datasets. We have access to DNA-methylome, miRNome, transcriptome and proteome data, measured in cell lines that show harmful protein aggregation and in negative controls. Standard sequential analysis of these data show no overlap of the significant genes. A combined analysis can detect relevant features shared by all datasets, improving the understanding of MSA.
Our aim is to develop a data integration method to identify consistent molecular biomarkers that can classify cells with protein aggregation across all datasets. Apart from the high dimensionality (p>N), also platform-specific heterogeneity between the omics data need to be considered. Several algorithmic approaches to integrate multiple datasets have been proposed, for example multi-group PLS (mg-PLS) and MINT. They decompose the datasets into joint and residual parts. The joint components capture consistent effects of the molecular measurements on the outcome across all datasets. The optimal components are obtained by iteratively maximising the covariance between the molecular measurements and a dummy matrix based on the binary outcome.
The drawbacks of mg-PLS and MINT are a lack of platform-specific parts in the decomposition, absence of a proper model for the binary outcome, and a risk of overfitting when data are high dimensional. Therefore, we propose a novel Probabilistic multi-group OPLS (mg-POPLS) model for multiple datasets in terms of joint, platform-specific and residual parts. Systematic differences between the omics data are incorporated in the model by including specific parts. The outcome is modelled via these components by using a latent probit model. The components and coefficients are estimated with maximum likelihood using an EM algorithm.
An extensive simulation study will be conducted to investigate the performance of mg-POPLS compared to mg-PLS. We apply the mg-POPLS method to the omics data measured in the cell lines to detect the most relevant features for separating MSA cases from controls.