1
|
Erfani M, Baalousha M, Goharian E. Unveiling elemental fingerprints: A comparative study of clustering methods for multi-element nanoparticle data. Sci Total Environ 2023; 905:167176. [PMID: 37730026 DOI: 10.1016/j.scitotenv.2023.167176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 09/03/2023] [Accepted: 09/16/2023] [Indexed: 09/22/2023]
Abstract
Single particle-inductively coupled plasma-time of flight-mass spectrometers (SP-ICP-TOF-MS) generates large datasets of the multi-elemental composition of nanoparticles. However, extracting useful information from such datasets is challenging. Hierarchical clustering (HC) has been successfully applied to extract elemental fingerprints from multi-element nanoparticle data obtained by SP-ICP-TOF-MS. However, many other clustering approaches can be applied to analyze SP-ICP-TOF-MS data that have not yet been evaluated. This study fills this knowledge gap by comparing the performance of three clustering approaches: HC, spectral clustering, and t-distributed Stochastic Neighbor Embedding coupled with Density-Based Spatial Clustering of Applications with Noise (tSNE-DBSCAN) for analyzing SP-ICP-TOF-MS data. The performance of these clustering techniques was evaluated by comparing the size of the extracted clusters and the similarity of the elemental composition of nanoparticles within each cluster. Hierarchical clustering often failed to achieve an optimal clustering solution for SP-ICP-TOF-MS data because HC is sensitive to the presence of outliers. Spectral clustering and tSNE-DBSCAN extracted clusters that were not identified by HC. This is because spectral clustering, a method developed based on graph theory, reveals the global and local structure in the data. tSNE reduces and maps the data into a lower-dimensional space, enabling clustering algorithms such as DBSCAN to identify subclusters with subtle differences in their elemental composition. However, tSNE-DBSCAN can lead to unsatisfactory clustering solutions because tuning the perplexity hyperparameter of tSNE is a difficult and a time-consuming task, and the relative distance between datapoints is not maintained. Although the three clustering approaches successfully extract useful information from SP-ICP-TOF-MS data, spectral clustering outperforms HC and tSNE-DBSCAN by generating clusters of a large number of nanoparticles with similar elemental compositions.
Collapse
Affiliation(s)
- Mahdi Erfani
- Department of Civil and Environmental Engineering, University of South Carolina, SC 29208, USA
| | - Mohammed Baalousha
- Center for Environmental Nanoscience and Risk, Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29201, USA.
| | - Erfan Goharian
- Department of Civil and Environmental Engineering, University of South Carolina, SC 29208, USA.
| |
Collapse
|
2
|
Jain R, Xu W. Artificial Intelligence based wrapper for high dimensional feature selection. BMC Bioinformatics 2023; 24:392. [PMID: 37853338 PMCID: PMC10585895 DOI: 10.1186/s12859-023-05502-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 09/27/2023] [Indexed: 10/20/2023] Open
Abstract
BACKGROUND Feature selection is important in high dimensional data analysis. The wrapper approach is one of the ways to perform feature selection, but it is computationally intensive as it builds and evaluates models of multiple subsets of features. The existing wrapper algorithm primarily focuses on shortening the path to find an optimal feature set. However, it underutilizes the capability of feature subset models, which impacts feature selection and its predictive performance. METHOD AND RESULTS This study proposes a novel Artificial Intelligence based Wrapper (AIWrap) algorithm that integrates Artificial Intelligence (AI) with the existing wrapper algorithm. The algorithm develops a Performance Prediction Model using AI which predicts the model performance of any feature set and allows the wrapper algorithm to evaluate the feature subset performance in a model without building the model. The algorithm can make the wrapper algorithm more relevant for high-dimensional data. We evaluate the performance of this algorithm using simulated studies and real research studies. AIWrap shows better or at par feature selection and model prediction performance than standard penalized feature selection algorithms and wrapper algorithms. CONCLUSION AIWrap approach provides an alternative algorithm to the existing algorithms for feature selection. The current study focuses on AIWrap application in continuous cross-sectional data. However, it could be applied to other datasets like longitudinal, categorical and time-to-event biological data.
Collapse
Affiliation(s)
- Rahi Jain
- Biostatistics Department, Princess Margaret Cancer Research Centre, Toronto, ON, Canada
| | - Wei Xu
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
3
|
Abdelwahed NM, El-Tawel GS, Makhlouf MA. Effective hybrid feature selection using different bootstrap enhances cancers classification performance. BioData Min 2022; 15:24. [PMID: 36175944 PMCID: PMC9523996 DOI: 10.1186/s13040-022-00304-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 08/31/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem. METHOD This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE. RESULTS The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively. CONCLUSION High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.
Collapse
Affiliation(s)
- Noura Mohammed Abdelwahed
- Department of Information Systems, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt.
| | - Gh S El-Tawel
- Department of Computer Science, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
| | - M A Makhlouf
- Department of Information Systems, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
| |
Collapse
|
4
|
Cousido-Rocha M, de Uña-Álvarez J. Equalden.HD: An R Package for testing the equality of a high dimensional set of densities. Comput Methods Programs Biomed 2022; 217:106694. [PMID: 35278813 DOI: 10.1016/j.cmpb.2022.106694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 10/17/2021] [Accepted: 02/07/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Nowadays the "low sample size, large dimension" scenario is often encountered in genetics and in the omic sciences, where the microarray data is typically formed by a large number of possibly dependent small samples. Standard methods to solve the k-sample problem in such a setting are of limited applicability due to lack of theoretical validation for large k, lengthy computational times, missing software solutions, or inability to deal with statistical dependence among the samples. This paper presents the R package Equalden.HD to overcome the referred limitations. METHODS The package implements several tests for the null hypothesis that a large number of samples follow a common density. These methods are particularly well suited to the "low sample size, large dimension" setting. The implemented procedures allow for dependent samples. For each method Equalden.HD reports, among other things, the standardized value of the test statistic and the corresponding p-value. The package also includes two high-dimensional genetic data sets, Hedenfalk and Rat, which are used in this paper for illustration purposes. RESULTS The usage of Equalden.HD has been illustrated through the analysis of Hedenfalk and Rat genetic data. Statistical dependence among the samples was found for both genetic data sets. The application of an appropriate k-sample test within Equalden.HD rejected the null hypothesis of inter-samples homogeneity. The methods were used to test for the within groups homogeneity in cluster analysis too, which is usually performed when the k samples are found to be significantly different. Equalden.HD helped to identify the individuals which are responsible for the lack of homogeneity of the samples. The limitations of the standard Kruskal-Wallis test for the identification of homogeneous clusters have been highlighted. CONCLUSIONS The methods implemented by Equalden.HD are the unique omnibus nonparametric k-sample tests that have been validated as k grows. Furthermore, the package provides suitable corrections for possibly dependent samples, which is another distinctive feature. Thus, the package opens new doors for the statistical analysis of omic data. Limitations of standard methods (e.g. Anderson-Darling and Kruskal-Wallis) and existing software solutions in the setting with a large k have been emphasized.
Collapse
Affiliation(s)
- Marta Cousido-Rocha
- SiDOR Research Group & CINBIO, Universidade de Vigo, Campus Lagoas-Marcosende, Vigo 36310, Spain.
| | - Jacobo de Uña-Álvarez
- SiDOR Research Group & CINBIO, Universidade de Vigo, Campus Lagoas-Marcosende, Vigo 36310, Spain
| |
Collapse
|
5
|
Sutton M, Sugier PE, Truong T, Liquet B. Leveraging pleiotropic association using sparse group variable selection in genomics data. BMC Med Res Methodol 2022; 22:9. [PMID: 34996381 PMCID: PMC8742466 DOI: 10.1186/s12874-021-01491-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 12/03/2021] [Indexed: 12/04/2022] Open
Abstract
Background Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation. In this article, we propose statistical methods which incorporate both gene pathway and pleiotropy knowledge to increase statistical power and identify important risk variants affecting multiple traits. Methods We propose novel feature selection methods for the group variable selection in multi-task regression problem. We develop penalised likelihood methods exploiting different penalties to induce structured sparsity at a gene (or pathway) and SNP level across all studies. We implement an alternating direction method of multipliers (ADMM) algorithm for our penalised regression methods. The performance of our approaches are compared to a subset based meta analysis approach on simulated data sets. A bootstrap sampling strategy is provided to explore the stability of the penalised methods. Results Our methods are applied to identify potential pleiotropy in an application considering the joint analysis of thyroid and breast cancers. The methods were able to detect eleven potential pleiotropic SNPs and six pathways. A simulation study found that our method was able to detect more true signals than a popular competing method while retaining a similar false discovery rate. Conclusion We developed feature selection methods for jointly analysing multiple logistic regression tasks where prior grouping knowledge is available. Our method performed well on both simulation studies and when applied to a real data analysis of multiple cancers.
Collapse
Affiliation(s)
- Matthew Sutton
- Queensland University of Technology Centre for Data Science, Brisbane, Australia.
| | - Pierre-Emmanuel Sugier
- Laboratoire De Mathématiques et de leurs Applications de PAU E2S UPPA, CNRS, Pau, France.,University Paris-Saclay, UVSQ, Inserm, Gustave Roussy, CESP, Team "Exposome and Heredity", Villejuif, France
| | - Therese Truong
- University Paris-Saclay, UVSQ, Inserm, Gustave Roussy, CESP, Team "Exposome and Heredity", Villejuif, France
| | - Benoit Liquet
- Laboratoire De Mathématiques et de leurs Applications de PAU E2S UPPA, CNRS, Pau, France.,Department of Mathematics and Statistics, Macquarie University, Sydney, Australia
| |
Collapse
|
6
|
Zhang H, Chen J, Li Z, Liu L. Testing for mediation effect with application to human microbiome data. Stat Biosci 2021; 13:313-328. [PMID: 34093887 PMCID: PMC8177450 DOI: 10.1007/s12561-019-09253-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Revised: 05/22/2019] [Accepted: 07/19/2019] [Indexed: 12/27/2022]
Abstract
Mediation analysis has been commonly used to study the effect of an exposure on an outcome through a mediator. In this paper, we are interested in exploring the mediation mechanism of microbiome, whose special features make the analysis challenging. We consider the isometric logratio transformation of the relative abundance as the mediator variable. Then, we present a de-biased Lasso estimate for the mediator of interest and derive its standard error estimator, which can be used to develop a test procedure for the interested mediation effect. Extensive simulation studies are conducted to assess the performance of our method. We apply the proposed approach to test the mediation effect of human gut microbiome between the dietary fiber intake and body mass index.
Collapse
Affiliation(s)
- Haixiang Zhang
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL 32610, USA
| | - Lei Liu
- Division of Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, USA
| |
Collapse
|
7
|
Abstract
BACKGROUND The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. RESULTS Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. CONCLUSION The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.
Collapse
Affiliation(s)
- Camilo Broc
- LIST, CEA, Laboratory for Data Sciences and Decision (Digiteo), Gif-sur-Yvette, France
- CNRS, Laboratoire de Mathématiques et de leurs Applications de PAU E2S UPPA, Pau, France
| | - Therese Truong
- UVSQ, Inserm, CESP, Université Paris-Saclay, 94807 Villejuif, France
- Institut Gustave Roussy, 94805 Villejuif, France
| | - Benoit Liquet
- CNRS, Laboratoire de Mathématiques et de leurs Applications de PAU E2S UPPA, Pau, France
- Department of Mathematics and Statistics, Macquarie University, Sydney, Australia
| |
Collapse
|
8
|
Knoll M, Furkel J, Debus J, Abdollahi A. modelBuildR: an R package for model building and feature selection with erroneous classifications. PeerJ 2021; 9:e10849. [PMID: 33614290 PMCID: PMC7879945 DOI: 10.7717/peerj.10849] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 01/06/2021] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Model building is a crucial part of omics based biomedical research to transfer classifications and obtain insights into underlying mechanisms. Feature selection is often based on minimizing error between model predictions and given classification (maximizing accuracy). Human ratings/classifications, however, might be error prone, with discordance rates between experts of 5-15%. We therefore evaluate if a feature pre-filtering step might improve identification of features associated with true underlying groups. METHODS Data was simulated for up to 100 samples and up to 10,000 features, 10% of which were associated with the ground truth comprising 2-10 normally distributed populations. Binary and semi-quantitative ratings with varying error probabilities were used as classification. For feature preselection standard cross-validation (V2) was compared to a novel heuristic (V1) applying univariate testing, multiplicity adjustment and cross-validation on switched dependent (classification) and independent (features) variables. Preselected features were used to train logistic regression/linear models (backward selection, AIC). Predictions were compared against the ground truth (ROC, multiclass-ROC). As use case, multiple feature selection/classification methods were benchmarked against the novel heuristic to identify prognostically different G-CIMP negative glioblastoma tumors from the TCGA-GBM 450 k methylation array data cohort, starting from a fuzzy umap based rough and erroneous separation. RESULTS V1 yielded higher median AUC ranks for two true groups (ground truth), with smaller differences for true graduated differences (3-10 groups). Lower fractions of models were successfully fit with V1. Median AUCs for binary classification and two true groups were 0.91 (range: 0.54-1.00) for V1 (Benjamini-Hochberg) and 0.70 (0.28-1.00) for V2, 13% (n = 616) of V2 models showed AUCs < = 50% for 25 samples and 100 features. For larger numbers of features and samples, median AUCs were 0.75 (range 0.59-1.00) for V1 and 0.54 (range 0.32-0.75) for V2. In the TCGA-GBM data, modelBuildR allowed best prognostic separation of patients with highest median overall survival difference (7.51 months) followed a difference of 6.04 months for a random forest based method. CONCLUSIONS The proposed heuristic is beneficial for the retrieval of features associated with two true groups classified with errors. We provide the R package modelBuildR to simplify (comparative) evaluation/application of the proposed heuristic (http://github.com/mknoll/modelBuildR).
Collapse
Affiliation(s)
- Maximilian Knoll
- Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Deutschland
- National Center for Tumor Disease (NCT), UKHD and German Cancer Research Center (DKFZ), Heidelberg, Germany
- German Cancer Consortium (DKTK), Core Center Heidelberg, DKFZ, Heidelberg, Germany
| | - Jennifer Furkel
- Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Deutschland
- National Center for Tumor Disease (NCT), UKHD and German Cancer Research Center (DKFZ), Heidelberg, Germany
- German Cancer Consortium (DKTK), Core Center Heidelberg, DKFZ, Heidelberg, Germany
| | - Juergen Debus
- Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Deutschland
- National Center for Tumor Disease (NCT), UKHD and German Cancer Research Center (DKFZ), Heidelberg, Germany
- German Cancer Consortium (DKTK), Core Center Heidelberg, DKFZ, Heidelberg, Germany
| | - Amir Abdollahi
- Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Deutschland
- National Center for Tumor Disease (NCT), UKHD and German Cancer Research Center (DKFZ), Heidelberg, Germany
- German Cancer Consortium (DKTK), Core Center Heidelberg, DKFZ, Heidelberg, Germany
| |
Collapse
|
9
|
Gu X, Tadesse MG, Foulkes AS, Ma Y, Balasubramanian R. Bayesian variable selection for high dimensional predictors and self-reported outcomes. BMC Med Inform Decis Mak 2020; 20:212. [PMID: 32894123 PMCID: PMC7487595 DOI: 10.1186/s12911-020-01223-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 08/16/2020] [Indexed: 11/28/2022] Open
Abstract
Background The onset of silent diseases such as type 2 diabetes is often registered through self-report in large prospective cohorts. Self-reported outcomes are cost-effective; however, they are subject to error. Diagnosis of silent events may also occur through the use of imperfect laboratory-based diagnostic tests. In this paper, we describe an approach for variable selection in high dimensional datasets for settings in which the outcome is observed with error. Methods We adapt the spike and slab Bayesian Variable Selection approach in the context of error-prone, self-reported outcomes. The performance of the proposed approach is studied through simulation studies. An illustrative application is included using data from the Women’s Health Initiative SNP Health Association Resource, which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women. Results Simulation studies show improved sensitivity of our proposed method when compared to a naive approach that ignores error in the self-reported outcomes. Application of the proposed method resulted in discovery of several single nucleotide polymorphisms (SNPs) that are associated with risk of type 2 diabetes in a dataset of 9,873 African American and Hispanic participants in the Women’s Health Initiative. There was little overlap among the top ranking SNPs associated with type 2 diabetes risk between the racial groups, adding support to previous observations in the literature of disease associated genetic loci that are often not generalizable across race/ethnicity populations. The adapted Bayesian variable selection algorithm is implemented in R. The source code for the simulations are available in the Supplement. Conclusions Variable selection accuracy is reduced when the outcome is ascertained by error-prone self-reports. For this setting, our proposed algorithm has improved variable selection performance when compared to approaches that neglect to account for the error-prone nature of self-reports.
Collapse
Affiliation(s)
- Xiangdong Gu
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, USA
| | - Mahlet G Tadesse
- Department of Mathematics and Statistics, Georgetown University, Washington, DC, USA
| | - Andrea S Foulkes
- Biostatistics Center, Division of Clinical Research, Massachusetts General Hospital Research Institute, Boston, MA, USA
| | - Yunsheng Ma
- Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Raji Balasubramanian
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, USA.
| |
Collapse
|
10
|
Abstract
BACKGROUND The problem of assessing associations between multiple omics data including genomics and metabolomics data to identify biomarkers potentially predictive of complex diseases has garnered considerable research interest nowadays. A popular epidemiology approach is to consider an association of each of the predictors with each of the response using a univariate linear regression model, and to select predictors that meet a priori specified significance level. Although this approach is simple and intuitive, it tends to require larger sample size which is costly. It also assumes variables for each data type are independent, and thus ignores correlations that exist between variables both within each data type and across the data types. RESULTS We consider a multivariate linear regression model that relates multiple predictors with multiple responses, and to identify multiple relevant predictors that are simultaneously associated with the responses. We assume the coefficient matrix of the responses on the predictors is both row-sparse and of low-rank, and propose a group Dantzig type formulation to estimate the coefficient matrix. CONCLUSION Extensive simulations demonstrate the competitive performance of our proposed method when compared to existing methods in terms of estimation, prediction, and variable selection. We use the proposed method to integrate genomics and metabolomics data to identify genetic variants that are potentially predictive of atherosclerosis cardiovascular disease (ASCVD) beyond well-established risk factors. Our analysis shows some genetic variants that increase prediction of ASCVD beyond some well-established factors of ASCVD, and also suggest a potential utility of the identified genetic variants in explaining possible association between certain metabolites and ASCVD.
Collapse
Affiliation(s)
- Haileab Hilafu
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 37996 TN USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| |
Collapse
|
11
|
Sieg M, Richter G, Schaefer AS, Kruppa J. Detection of suspicious interactions of spiking covariates in methylation data. BMC Bioinformatics 2020; 21:36. [PMID: 32000657 PMCID: PMC6993406 DOI: 10.1186/s12859-020-3364-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Accepted: 01/14/2020] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category "heavy smoker" is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results. RESULTS We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses. CONCLUSIONS We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.
Collapse
Affiliation(s)
- Miriam Sieg
- Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117 Germany
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178 Germany
| | - Gesa Richter
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178 Germany
- Department of Periodontology and Synoptic Dentistry, Institute of Dental, Oral and Maxillary Medicine, Charité - University Medicine, Charitéplatz 1, Berlin, 10117 Germany
| | - Arne S. Schaefer
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178 Germany
- Department of Periodontology and Synoptic Dentistry, Institute of Dental, Oral and Maxillary Medicine, Charité - University Medicine, Charitéplatz 1, Berlin, 10117 Germany
| | - Jochen Kruppa
- Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117 Germany
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178 Germany
| |
Collapse
|
12
|
Kokla M, Virtanen J, Kolehmainen M, Paananen J, Hanhineva K. Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study. BMC Bioinformatics 2019; 20:492. [PMID: 31601178 PMCID: PMC6788053 DOI: 10.1186/s12859-019-3110-0] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 09/20/2019] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.
Collapse
Affiliation(s)
- Marietta Kokla
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Jyrki Virtanen
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Marjukka Kolehmainen
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
- VTT Technical Research Centre of Finland Ltd, P.O. Box 1000, FI-02044 VTT Espoo, Finland
| | - Jussi Paananen
- Institute of Biomedicine, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| | - Kati Hanhineva
- Institute of Public Health and Clinical Nutrition, University of Eastern Finland, Kuopio Campus, P.O. Box 1627, FI-70211 Kuopio, Finland
| |
Collapse
|
13
|
Abstract
Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher's combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a non-asymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn's disease and compared with several existing tests.
Collapse
Affiliation(s)
- Yaowu Liu
- Department of Biostatistics, Harvard School of Public Health
| | - Jun Xie
- Department of Statistics, Purdue University
| |
Collapse
|
14
|
Lu J, Shi P, Li H. Generalized linear models with linear constraints for microbiome compositional data. Biometrics 2018; 75:235-244. [PMID: 30039859 DOI: 10.1111/biom.12956] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2017] [Revised: 06/01/2018] [Accepted: 06/01/2018] [Indexed: 01/04/2023]
Abstract
Motivated by regression analysis for microbiome compositional data, this article considers generalized linear regression analysis with compositional covariates, where a group of linear constraints on regression coefficients are imposed to account for the compositional nature of the data and to achieve subcompositional coherence. A penalized likelihood estimation procedure using a generalized accelerated proximal gradient method is developed to efficiently estimate the regression coefficients. A de-biased procedure is developed to obtain asymptotically unbiased and normally distributed estimates, which leads to valid confidence intervals of the regression coefficients. Simulations results show the correctness of the coverage probability of the confidence intervals and smaller variances of the estimates when the appropriate linear constraints are imposed. The methods are illustrated by a microbiome study in order to identify bacterial species that are associated with inflammatory bowel disease (IBD) and to predict IBD using fecal microbiome.
Collapse
Affiliation(s)
- Jiarui Lu
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Pixu Shi
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
15
|
Martínez-Ávila JC, García Bartolomé A, García I, Dapía I, Tong HY, Díaz L, Guerra P, Frías J, Carcás Sansuan AJ, Borobia AM. Pharmacometabolomics applied to zonisamide pharmacokinetic parameter prediction. Metabolomics 2018; 14:70. [PMID: 30830352 DOI: 10.1007/s11306-018-1365-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Accepted: 04/25/2018] [Indexed: 10/16/2022]
Abstract
INTRODUCTION Zonisamide is a new-generation anticonvulsant antiepileptic drug metabolized primarily in the liver, with subsequent elimination via the renal route. OBJECTIVES Our objective was to evaluate the utility of pharmacometabolomics in the detection of zonisamide metabolites that could be related to its disposition and therefore, to its efficacy and toxicity. METHODS This study was nested to a bioequivalence clinical trial with 28 healthy volunteers. Each participant received a single dose of zonisamide on two separate occasions (period 1 and period 2), with a washout period between them. Blood samples of zonisamide were obtained from all patients at baseline for each period, before volunteers were administered any medication, for metabolomics analysis. RESULTS After a Lasso regression was applied, age, height, branched-chain amino acids, steroids, triacylglycerols, diacyl glycerophosphoethanolamine, glycerophospholipids susceptible to methylation, phosphatidylcholines with 20:4 FA (arachidonic acid) and cholesterol ester and lysophosphatidylcholine were obtained in both periods. CONCLUSION To our knowledge, this is the only research study to date that has attempted to link basal metabolomic status with pharmacokinetic parameters of zonisamide.
Collapse
Affiliation(s)
- J C Martínez-Ávila
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain.
| | - A García Bartolomé
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - I García
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - I Dapía
- Medical and Molecular Genetics Institute (INGEMM), La Paz University Hospital, Rare Diseases Networking Biomedical Research Center (CIBERER), ISCIII, Madrid, Spain
| | - Hoi Y Tong
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - L Díaz
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - P Guerra
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - J Frías
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - A J Carcás Sansuan
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain.
| | - A M Borobia
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| |
Collapse
|
16
|
Abstract
Feature screening plays an important role in ultrahigh dimensional data analysis. This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors (e.g., genetic makers) given a low-dimensional exposure variable (such as clinical variables or environmental variables). To this end, we first propose a new index to measure conditional independence, and further develop a conditional screening procedure based on the newly proposed index. We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions. The newly proposed screening procedure enjoys some appealing properties. (a) It is model-free in that its implementation does not require a specification on the model structure; (b) it is robust to heavy-tailed distributions or outliers in both directions of response and predictors; and (c) it can deal with both feature screening and the conditional screening in a unified way. We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.
Collapse
Affiliation(s)
- Luheng Wang
- School of Mathematics, Beijing Normal University, Beijing 100875, P.R. China
| | - Jingyuan Liu
- Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics and Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China
| | - Yong Li
- School of Statistics, Beijing Normal University, Beijing 100875, P.R. China
| | - Runze Li
- Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA 16802-2111, USA
| |
Collapse
|
17
|
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinformatics 2017; 18:114. [PMID: 28219348 PMCID: PMC5319174 DOI: 10.1186/s12859-017-1547-6] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 02/13/2017] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. RESULTS Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. CONCLUSION Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.
Collapse
Affiliation(s)
- Jasmit S Shah
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA. .,Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA.
| | - Shesh N Rai
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA
| | - Andrew P DeFilippis
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Bradford G Hill
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Aruni Bhatnagar
- Department of Medicine, Division of Cardiovascular Medicine, Diabetes and Obesity Center, University of Louisville, Louisville, KY, 40202, USA
| | - Guy N Brock
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY, 40202, USA. .,Present Affiliation: Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
18
|
Abstract
In linear regression models with high dimensional data, the classical z-test (or t-test) for testing the significance of each single regression coefficient is no longer applicable. This is mainly because the number of covariates exceeds the sample size. In this paper, we propose a simple and novel alternative by introducing the Correlated Predictors Screening (CPS) method to control for predictors that are highly correlated with the target covariate. Accordingly, the classical ordinary least squares approach can be employed to estimate the regression coefficient associated with the target covariate. In addition, we demonstrate that the resulting estimator is consistent and asymptotically normal even if the random errors are heteroscedastic. This enables us to apply the z-test to assess the significance of each covariate. Based on the p-value obtained from testing the significance of each covariate, we further conduct multiple hypothesis testing by controlling the false discovery rate at the nominal level. Then, we show that the multiple hypothesis testing achieves consistent model selection. Simulation studies and empirical examples are presented to illustrate the finite sample performance and the usefulness of the proposed method, respectively.
Collapse
Affiliation(s)
- Wei Lan
- Statistics School and Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, PR China
| | - Ping-Shou Zhong
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48823, United States
| | - Runze Li
- Department of Statistics and the Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111, United States
| | - Hansheng Wang
- Department of Business Statistics and Econometrics, Guanghua School of Management, Peking University, Beijing, 100871, PR China
| | - Chih-Ling Tsai
- Graduate School of Management, University of California, Davis, CA 95616-8609, United States
| |
Collapse
|
19
|
Abstract
BACKGROUND Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. RESULTS We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. CONCLUSIONS Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat .
Collapse
Affiliation(s)
- Bertrand S. Clarke
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Saeid Amiri
- Department of Natural and Applied Sciences, University of Wisconsin Madison, Iowa City, IA, USA
| | - Jennifer L. Clarke
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
- Department of Food Science and Technology, University of Nebraska-Lincoln, Lincoln, NE, USA
| |
Collapse
|
20
|
Zhao LP, Bolouri H. Object-oriented regression for building predictive models with high dimensional omics data from translational studies. J Biomed Inform 2016; 60:431-45. [PMID: 26972839 PMCID: PMC5097461 DOI: 10.1016/j.jbi.2016.03.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Revised: 02/23/2016] [Accepted: 03/01/2016] [Indexed: 12/31/2022]
Abstract
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015).
Collapse
Affiliation(s)
- Lue Ping Zhao
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, United States; Department of Biostatistics and Epidemiology, University of Washington School of Public Health, Seattle, WA, United States.
| | - Hamid Bolouri
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, United States
| |
Collapse
|
21
|
Abstract
In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T, n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and the impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
| | - Fang Han
- Johns Hopkins University, Baltimore, USA
| | - Han Liu
- Princeton University, Princeton, USA
| | | |
Collapse
|
22
|
Engel J, Blanchet L, Bloemen B, van den Heuvel LP, Engelke UHF, Wevers RA, Buydens LMC. Regularized MANOVA (rMANOVA) in untargeted metabolomics. Anal Chim Acta 2015; 899:1-12. [PMID: 26547490 DOI: 10.1016/j.aca.2015.06.042] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Revised: 06/09/2015] [Accepted: 06/11/2015] [Indexed: 12/14/2022]
Abstract
Many advanced metabolomics experiments currently lead to data where a large number of response variables were measured while one or several factors were changed. Often the number of response variables vastly exceeds the sample size and well-established techniques such as multivariate analysis of variance (MANOVA) cannot be used to analyze the data. ANOVA simultaneous component analysis (ASCA) is an alternative to MANOVA for analysis of metabolomics data from an experimental design. In this paper, we show that ASCA assumes that none of the metabolites are correlated and that they all have the same variance. Because of these assumptions, ASCA may relate the wrong variables to a factor. This reduces the power of the method and hampers interpretation. We propose an improved model that is essentially a weighted average of the ASCA and MANOVA models. The optimal weight is determined in a data-driven fashion. Compared to ASCA, this method assumes that variables can correlate, leading to a more realistic view of the data. Compared to MANOVA, the model is also applicable when the number of samples is (much) smaller than the number of variables. These advantages are demonstrated by means of simulated and real data examples. The source code of the method is available from the first author upon request, and at the following github repository: https://github.com/JasperE/regularized-MANOVA.
Collapse
Affiliation(s)
- J Engel
- Radboud University Nijmegen, Institute for Molecules and Materials, Heyendaalseweg 135, Nijmegen, The Netherlands; Translational Metabolic Laboratory at the Department of Laboratory Medicine, Radboud University Medical Centre, Geert Grooteplein 10, Nijmegen, The Netherlands
| | - L Blanchet
- Radboud University Nijmegen, Institute for Molecules and Materials, Heyendaalseweg 135, Nijmegen, The Netherlands; Department of Biochemistry, Nijmegen Centre for Molecular Life Sciences, Radboud University Medical Centre, Geert Grooteplein 10, Nijmegen, The Netherlands
| | - B Bloemen
- Radboud University Nijmegen, Institute for Molecules and Materials, Heyendaalseweg 135, Nijmegen, The Netherlands
| | - L P van den Heuvel
- Translational Metabolic Laboratory at the Department of Laboratory Medicine, Radboud University Medical Centre, Geert Grooteplein 10, Nijmegen, The Netherlands
| | - U H F Engelke
- Translational Metabolic Laboratory at the Department of Laboratory Medicine, Radboud University Medical Centre, Geert Grooteplein 10, Nijmegen, The Netherlands
| | - R A Wevers
- Translational Metabolic Laboratory at the Department of Laboratory Medicine, Radboud University Medical Centre, Geert Grooteplein 10, Nijmegen, The Netherlands
| | - L M C Buydens
- Radboud University Nijmegen, Institute for Molecules and Materials, Heyendaalseweg 135, Nijmegen, The Netherlands.
| |
Collapse
|
23
|
Kim J, Wozniak JR, Mueller BA, Shen X, Pan W. Comparison of statistical tests for group differences in brain functional networks. Neuroimage 2014; 101:681-94. [PMID: 25086298 PMCID: PMC4165845 DOI: 10.1016/j.neuroimage.2014.07.031] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Revised: 06/30/2014] [Accepted: 07/21/2014] [Indexed: 01/13/2023] Open
Abstract
Brain functional connectivity has been studied by analyzing time series correlations in regional brain activities based on resting-state fMRI data. Brain functional connectivity can be depicted as a network or graph defined as a set of nodes linked by edges. Nodes represent brain regions and an edge measures the strength of functional correlation between two regions. Most of existing work focuses on estimation of such a network. A key but inadequately addressed question is how to test for possible differences of the networks between two subject groups, say between healthy controls and patients. Here we illustrate and compare the performance of several state-of-the-art statistical tests drawn from the neuroimaging, genetics, ecology and high-dimensional data literatures. Both real and simulated data were used to evaluate the methods. We found that Network Based Statistic (NBS) performed well in many but not all situations, and its performance critically depends on the choice of its threshold parameter, which is unknown and difficult to choose in practice. Importantly, two adaptive statistical tests called adaptive sum of powered score (aSPU) and its weighted version (aSPUw) are easy to use and complementary to NBS, being higher powered than NBS in some situations. The aSPU and aSPUw tests can also be applied to adjust for covariates. Between the aSPU and aSPUw tests, they often, but not always, performed similarly with neither one as a uniform winner. On the other hand, Multivariate Matrix Distance Regression (MDMR) has been applied to detect group differences for brain connectivity; with the usual choice of the Euclidean distance, MDMR is a special case of the aSPU test. Consequently NBS, aSPU and aSPUw tests are recommended to test for group differences in functional connectivity.
Collapse
Affiliation(s)
- Junghi Kim
- Division of Biostatistics, University of Minnesota, USA
| | | | | | | | - Wei Pan
- Division of Biostatistics, University of Minnesota, USA.
| |
Collapse
|
24
|
Magnotti JF, Billor N. Finding multivariate outliers in fMRI time-series data. Comput Biol Med 2014; 53:115-24. [PMID: 25129023 DOI: 10.1016/j.compbiomed.2014.05.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2013] [Revised: 05/18/2014] [Accepted: 05/27/2014] [Indexed: 11/23/2022]
Abstract
A fundamental challenge for researchers studying the brain is to explain how distributed patterns of brain activity relate to a specific representation or computation. Multivariate techniques are therefore becoming increasingly popular for pattern localization of functional magnetic resonance imaging (fMRI) data. The increased power of these techniques can be offset by their susceptibility to multivariate outliers, a problem not directly encountered when fMRI data are analyzed in more common univariate analysis techniques. We test how two algorithms, High Dimensional Blocked Adaptive Computationally Efficient Outlier Nominators (HD BACON) and Principal Component based Outlier detection (PCOut), can detect multivariate outliers in high-dimensional fMRI data, in which the number of variables is larger than the number of observations. We show how these methods can be applied to individual, voxel time-series to identify outlying voxels within a region of interest. Finally, we compare these methods with simulated data to identify which aspects of the data each method is most sensitive to. Voxels identified by both algorithms were primarily on the edges of univariate activation clusters and near the boundaries between different tissue types. Simulation results showed the PCOut outperformed HD BACON, maintaining both high sensitivity and specificity across a wide range of outlier contamination percentages. Our results suggest that multivariate analysis of fMRI can benefit from including multivariate outlier detection as a routine data quality check prior to model fitting.
Collapse
|