101
|
A recursive framework for predicting the time-course of drug sensitivity. Sci Rep 2020; 10:17682. [PMID: 33077880 PMCID: PMC7573611 DOI: 10.1038/s41598-020-74725-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 10/05/2020] [Indexed: 11/08/2022] Open
Abstract
The biological processes involved in a drug’s mechanisms of action are oftentimes dynamic, complex and difficult to discern. Time-course gene expression data is a rich source of information that can be used to unravel these complex processes, identify biomarkers of drug sensitivity and predict the response to a drug. However, the majority of previous work has not fully utilized this temporal dimension. In these studies, the gene expression data is either considered at one time-point (before the administration of the drug) or two time-points (before and after the administration of the drug). This is clearly inadequate in modeling dynamic gene–drug interactions, especially for applications such as long-term drug therapy. In this work, we present a novel REcursive Prediction (REP) framework for drug response prediction by taking advantage of time-course gene expression data. Our goal is to predict drug response values at every stage of a long-term treatment, given the expression levels of genes collected in the previous time-points. To this end, REP employs a built-in recursive structure that exploits the intrinsic time-course nature of the data and integrates past values of drug responses for subsequent predictions. It also incorporates tensor completion that can not only alleviate the impact of noise and missing data, but also predict unseen gene expression levels (GEXs). These advantages enable REP to estimate drug response at any stage of a given treatment from some GEXs measured in the beginning of the treatment. Extensive experiments on two datasets corresponding to multiple sclerosis patients treated with interferon are included to showcase the effectiveness of REP.
Collapse
|
102
|
Xu D, Zhang J, Xu H, Zhang Y, Chen W, Gao R, Dehmer M. Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genomics 2020; 21:650. [PMID: 32962626 PMCID: PMC7510277 DOI: 10.1186/s12864-020-07038-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 08/30/2020] [Indexed: 12/19/2022] Open
Abstract
Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Jialin Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, 250061, China
| | - Matthias Dehmer
- Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, Steyr, Austria.,College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria
| |
Collapse
|
103
|
Zhang P, Xia Q, Liu L, Li S, Dong L. Current Opinion on Molecular Characterization for GBM Classification in Guiding Clinical Diagnosis, Prognosis, and Therapy. Front Mol Biosci 2020; 7:562798. [PMID: 33102518 PMCID: PMC7506064 DOI: 10.3389/fmolb.2020.562798] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 08/18/2020] [Indexed: 12/11/2022] Open
Abstract
Glioblastoma (GBM) is highly invasive and the deadliest brain tumor in adults. It is characterized by inter-tumor and intra-tumor heterogeneity, short patient survival, and lack of effective treatment. Prognosis and therapy selection is driven by molecular data from gene transcription, genetic alterations and DNA methylation. The four GBM molecular subtypes are proneural, neural, classical, and mesenchymal. More effective personalized therapy heavily depends on higher resolution molecular subtype signatures, combined with gene therapy, immunotherapy and organoid technology. In this review, we summarize the principal GBM molecular classifications that guide diagnosis, prognosis, and therapeutic recommendations.
Collapse
Affiliation(s)
- Pei Zhang
- School of Life Sciences, Beijing Institute of Technology, Beijing, China
| | - Qin Xia
- School of Life Sciences, Beijing Institute of Technology, Beijing, China
| | - Liqun Liu
- School of Life Sciences, Beijing Institute of Technology, Beijing, China
| | - Shouwei Li
- Department of Neurosurgery, Sanbo Brain Hospital, Capital Medical University, Beijing, China
| | - Lei Dong
- School of Life Sciences, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
104
|
Classification of gene expression patterns using a novel type-2 fuzzy multigranulation-based SVM model for the recognition of cancer mediating biomarkers. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05241-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
105
|
Brain Tumor Segmentation Using Deep Learning and Fuzzy K-Means Clustering for Magnetic Resonance Images. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10326-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
106
|
Gene Expression Clustering and Selected Head and Neck Cancer Gene Signatures Highlight Risk Probability Differences in Oral Premalignant Lesions. Cells 2020; 9:cells9081828. [PMID: 32756466 PMCID: PMC7466020 DOI: 10.3390/cells9081828] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 07/27/2020] [Accepted: 07/31/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Oral premalignant lesions (OPLs) represent the most common oral precancerous conditions. One of the major challenges in this field is the identification of OPLs at higher risk for oral squamous cell cancer (OSCC) development, by discovering molecular pathways deregulated in the early steps of malignant transformation. Analysis of deregulated levels of single genes and pathways has been successfully applied to head and neck squamous cell cancers (HNSCC) and OSCC with prognostic/predictive implications. Exploiting the availability of gene expression profile and clinical follow-up information of a well-characterized cohort of OPL patients, we aim to dissect tissue OPL gene expression to identify molecular clusters/signatures associated with oral cancer free survival (OCFS). MATERIALS AND METHODS The gene expression data of 86 OPL patients were challenged with: an HNSCC specific 6 molecular subtypes model (Immune related: HPV related, Defense Response and Immunoreactive; Mesenchymal, Hypoxia and Classical); one OSCC-specific signature (13 genes); two metabolism-related signatures (3 genes and signatures raised from 6 metabolic pathways associated with prognosis in HNSCC and OSCC, respectively); a hypoxia gene signature. The molecular stratification and high versus low expression of the signatures were correlated with OCFS by Kaplan-Meier analyses. The association of gene expression profiles among the tested biological models and clinical covariates was tested through variance partition analysis. RESULTS Patients with Mesenchymal, Hypoxia and Classical clusters showed an higher risk of malignant transformation in comparison with immune-related ones (log-rank test, p = 0.0052) and they expressed four enriched hallmarks: "TGF beta signaling" "angiogenesis", "unfolded protein response", "apical junction". Overall, 54 cases entered in the immune related clusters, while the remaining 32 cases belonged to the other clusters. No other signatures showed association with OCFS. Our variance partition analysis proved that clinical and molecular features are able to explain only 21% of gene expression data variability, while the remaining 79% refers to residuals independent of known parameters. CONCLUSIONS Applying the existing signatures derived from HNSCC to OPL, we identified only a protective effect for immune-related signatures. Other gene expression profiles derived from overt cancers were not able to identify the risk of malignant transformation, possibly because they are linked to later stages of cancer progression. The availability of a new well-characterized set of OPL patients and further research is needed to improve the identification of adequate prognosticators in OPLs.
Collapse
|
107
|
Alharthi AM, Lee MH, Algamal ZY, Al-Fakih AM. Quantitative structure-activity relationship model for classifying the diverse series of antifungal agents using ratio weighted penalized logistic regression. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2020; 31:571-583. [PMID: 32628042 DOI: 10.1080/1062936x.2020.1782467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Accepted: 06/10/2020] [Indexed: 06/11/2023]
Abstract
One of the most challenging issues when facing a Quantitative structure-activity relationship (QSAR) classification model is to deal with the descriptor selection. Penalized methods have been adapted and have gained popularity as a key for simultaneously performing descriptor selection and QSAR classification model estimation. However, penalized methods have drawbacks such as having biases and inconsistencies that make they lack the oracle properties. This paper proposes an adaptive penalized logistic regression (APLR) to overcome these drawbacks. This is done by employing a ratio (BWR) of the descriptors between-groups sum of squares (BSS) to the within-groups sum of squares (WSS) for each descriptor as a weight inside the L1-norm. The proposed method was applied to one dataset that consists of a diverse series of antimicrobial agents with their respective bioactivities against Candida albicans. By experimental study, it has been shown that the proposed method (APLR) was more efficient in the selection of descriptors and classification accuracy than the other competitive methods that could be used in developing QSAR classification models. Another dataset was also successfully experienced. Therefore, it can be concluded that the APLR method had significant impact on QSAR analysis and studies.
Collapse
Affiliation(s)
- A M Alharthi
- Department of Mathematical Sciences, Universiti Teknologi Malaysia , Skudai, Malaysia
| | - M H Lee
- Department of Mathematical Sciences, Universiti Teknologi Malaysia , Skudai, Malaysia
| | - Z Y Algamal
- Department of Statistics and Informatics, University of Mosul , Mosul, Iraq
| | - A M Al-Fakih
- Department of Chemistry, Faculty of Science, Universiti Teknologi Malaysia , Johor, Malaysia
| |
Collapse
|
108
|
Zi X, Chen H. Robust tests of the equality of two high-dimensional covariance matrices. COMMUN STAT-THEOR M 2020. [DOI: 10.1080/03610926.2020.1788085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Xuemin Zi
- School of Science, Tianjin University of Technology and Education, Tianjin, China
| | - Hui Chen
- School of Statistics and Data Science, Nankai University, Tianjin, China
| |
Collapse
|
109
|
Wagala A, González-Farías G, Ramos R, Dalmau O. PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification Problem. REVISTA COLOMBIANA DE ESTADÍSTICA 2020. [DOI: 10.15446/rce.v43n2.81811] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
This study involves the implentation of the extensions of the partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). A comparative study of the obtained classifiers with the classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM) is then carried out. Furthermore, a new methodology known as kernel multilogit algorithm (KMA) is also implemented and its performance compared with those of the other classifiers. The KMA emerged as the best classifier based on the lowest classification error rates compared to the others when applied to the types of data are considered; the un- preprocessed and preprocessed.
Collapse
|
110
|
|
111
|
Meng X, Wang H, Feng L. The similarity-consensus regularized multi-view learning for dimension reduction. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105835] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
112
|
Yang X, Tian L, Chen Y, Yang L, Xu S, Wu W. Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1262-1275. [PMID: 30575544 DOI: 10.1109/tcbb.2018.2886334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples, and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is first proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.
Collapse
|
113
|
Hu X, Hu Y, Wu F, Leung RWT, Qin J. Integration of single-cell multi-omics for gene regulatory network inference. Comput Struct Biotechnol J 2020; 18:1925-1938. [PMID: 32774787 PMCID: PMC7385034 DOI: 10.1016/j.csbj.2020.06.033] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 06/17/2020] [Accepted: 06/20/2020] [Indexed: 12/20/2022] Open
Abstract
The advancement of single-cell sequencing technology in recent years has provided an opportunity to reconstruct gene regulatory networks (GRNs) with the data from thousands of single cells in one sample. This uncovers regulatory interactions in cells and speeds up the discoveries of regulatory mechanisms in diseases and biological processes. Therefore, more methods have been proposed to reconstruct GRNs using single-cell sequencing data. In this review, we introduce technologies for sequencing single-cell genome, transcriptome, and epigenome. At the same time, we present an overview of current GRN reconstruction strategies utilizing different single-cell sequencing data. Bioinformatics tools were grouped by their input data type and mathematical principles for reader's convenience, and the fundamental mathematics inherent in each group will be discussed. Furthermore, the adaptabilities and limitations of these different methods will also be summarized and compared, with the hope to facilitate researchers recognizing the most suitable tools for them.
Collapse
Affiliation(s)
- Xinlin Hu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, China
| | - Yaohua Hu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, China
| | - Fanjie Wu
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| | - Ricky Wai Tak Leung
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| | - Jing Qin
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen 518107, China
| |
Collapse
|
114
|
Prediction of Protein Tertiary Structure via Regularized Template Classification Techniques. Molecules 2020; 25:molecules25112467. [PMID: 32466409 PMCID: PMC7321371 DOI: 10.3390/molecules25112467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/21/2020] [Accepted: 05/22/2020] [Indexed: 11/24/2022] Open
Abstract
We discuss the use of the regularized linear discriminant analysis (LDA) as a model reduction technique combined with particle swarm optimization (PSO) in protein tertiary structure prediction, followed by structure refinement based on singular value decomposition (SVD) and PSO. The algorithm presented in this paper corresponds to the category of template-based modeling. The algorithm performs a preselection of protein templates before constructing a lower dimensional subspace via a regularized LDA. The protein coordinates in the reduced spaced are sampled using a highly explorative optimization algorithm, regressive–regressive PSO (RR-PSO). The obtained structure is then projected onto a reduced space via singular value decomposition and further optimized via RR-PSO to carry out a structure refinement. The final structures are similar to those predicted by best structure prediction tools, such as Rossetta and Zhang servers. The main advantage of our methodology is that alleviates the ill-posed character of protein structure prediction problems related to high dimensional optimization. It is also capable of sampling a wide range of conformational space due to the application of a regularized linear discriminant analysis, which allows us to expand the differences over a reduced basis set.
Collapse
|
115
|
Lu M, Fan Z, Xu B, Chen L, Zheng X, Li J, Znati T, Mi Q, Jiang J. Using machine learning to predict ovarian cancer. Int J Med Inform 2020; 141:104195. [PMID: 32485554 DOI: 10.1016/j.ijmedinf.2020.104195] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 04/24/2020] [Accepted: 05/21/2020] [Indexed: 12/17/2022]
Abstract
OBJECTIVE Ovarian cancer (OC) is one of the most common types of cancer in women. Accurately prediction of benign ovarian tumors (BOT) and OC has important practical value. METHODS Our dataset consists of 349 Chinese patients with 49 variables including demographics, blood routine test, general chemistry, and tumor markers. Machine learning Minimum Redundancy - Maximum Relevance (MRMR) feature selection method was applied on the 235 patients' data (89 BOT and 146 OC) to select the most relevant features, with which a simple decision tree model was constructed. The model was tested on the rest of 114 patients (89 BOT and 25 OC). The results were compared with the predictions produced by using the risk of ovarian malignancy algorithm (ROMA) and logistic regression model. RESULTS Eight notable features were selected by MRMR, among which two were identified as the top features by the decision tree model: human epididymis protein 4 (HE4) and carcinoembryonic antigen (CEA). Particularly, CEA is a valuable marker for OC prediction in patients with low HE4. The model also yields better prediction result than ROMA. CONCLUSION Machine learning approaches were able to accurately classify BOT and OC. Our goal is to derive a simple predictive model which also carries a good performance. Using our approach, we obtained a model that consists of just two biomarkers, HE4 and CEA. The model is simple to interpret and outperforms the existing OC prediction methods. It demonstrates that the machine learning approach has good potential in predictive modeling for the complex diseases.
Collapse
Affiliation(s)
- Mingyang Lu
- Department of Tumor Biological Treatment, the Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu, People's Republic of China; Jiangsu Engineering Research Center for Tumor Immunotherapy, Changzhou, Jiangsu, People's Republic of China; Institute of Cell Therapy, Soochow University, Changzhou, Jiangsu, People's Republic of China
| | - Zhenjiang Fan
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
| | - Bin Xu
- Department of Tumor Biological Treatment, the Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu, People's Republic of China; Jiangsu Engineering Research Center for Tumor Immunotherapy, Changzhou, Jiangsu, People's Republic of China; Institute of Cell Therapy, Soochow University, Changzhou, Jiangsu, People's Republic of China
| | - Lujun Chen
- Department of Tumor Biological Treatment, the Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu, People's Republic of China; Jiangsu Engineering Research Center for Tumor Immunotherapy, Changzhou, Jiangsu, People's Republic of China; Institute of Cell Therapy, Soochow University, Changzhou, Jiangsu, People's Republic of China
| | - Xiao Zheng
- Department of Tumor Biological Treatment, the Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu, People's Republic of China; Jiangsu Engineering Research Center for Tumor Immunotherapy, Changzhou, Jiangsu, People's Republic of China; Institute of Cell Therapy, Soochow University, Changzhou, Jiangsu, People's Republic of China
| | - Jundong Li
- Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA, USA
| | - Taieb Znati
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
| | - Qi Mi
- Department of Sports Medicine and Nutrition, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Jingting Jiang
- Department of Tumor Biological Treatment, the Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu, People's Republic of China; Jiangsu Engineering Research Center for Tumor Immunotherapy, Changzhou, Jiangsu, People's Republic of China; Institute of Cell Therapy, Soochow University, Changzhou, Jiangsu, People's Republic of China.
| |
Collapse
|
116
|
|
117
|
Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling. COMPUTERS 2020. [DOI: 10.3390/computers9020037] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences.
Collapse
|
118
|
Alaiz-Rodríguez R, Parnell AC. An information theoretic approach to quantify the stability of feature selection and ranking algorithms. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105745] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
119
|
Xia F, George SL, Ning J, Li L, Huang X. A Signature Enrichment Design with Bayesian Adaptive Randomization. J Appl Stat 2020; 48:1091-1110. [PMID: 34024982 DOI: 10.1080/02664763.2020.1757048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Clinical trials in the era of precision cancer medicine aim to identify and validate biomarker signatures which can guide the assignment of individually optimal treatments to patients. In this article, we propose a group sequential randomized phase II design, which updates the biomarker signature as the trial goes on, utilizes enrichment strategies for patient selection, and uses Bayesian response-adaptive randomization for treatment assignment. To evaluate the performance of the new design, in addition to the commonly considered criteria of type I error and power, we propose four new criteria measuring the benefits and losses for individuals both inside and outside of the clinical trial. Compared with designs with equal randomization, the proposed design gives trial participants a better chance to receive their personalized optimal treatments and thus results in a higher response rate on the trial. This design increases the chance to discover a successful new drug by an adaptive enrichment strategy, i.e., identification and selective enrollment of a subset of patients who are sensitive to the experimental therapies. Simulation studies demonstrate these advantages of the proposed design. It is illustrated by an example based on an actual clinical trial in non-small-cell lung cancer.
Collapse
Affiliation(s)
- Fang Xia
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center
| | - Stephen L George
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine
| | - Jing Ning
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center
| | - Liang Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center
| | - Xuelin Huang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center
| |
Collapse
|
120
|
Ishii A. A classifier under the strongly spiked eigenvalue model in high-dimension, low-sample-size context. COMMUN STAT-THEOR M 2020. [DOI: 10.1080/03610926.2018.1528365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Aki Ishii
- Department of Information Sciences, Tokyo University of Science, Chiba, Japan
| |
Collapse
|
121
|
|
122
|
Shen L, Yin Q. Data maximum dispersion classifier in projection space for high-dimension low-sample-size problems. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
123
|
Ekstrøm CT, Gerds TA, Jensen AK. Sequential rank agreement methods for comparison of ranked lists. Biostatistics 2020; 20:582-598. [PMID: 29868883 DOI: 10.1093/biostatistics/kxy017] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 04/22/2018] [Indexed: 11/14/2022] Open
Abstract
The comparison of alternative rankings of a set of items is a general and common task in applied statistics. Predictor variables are ranked according to magnitude of association with an outcome, prediction models rank subjects according to the personalized risk of an event, and genetic studies rank genes according to their difference in gene expression levels. We propose a sequential rank agreement measure to quantify the rank agreement among two or more ordered lists. This measure has an intuitive interpretation, it can be applied to any number of lists even if some are partially incomplete, and it provides information about the agreement along the lists. The sequential rank agreement can be evaluated analytically or be compared graphically to a permutation based reference set in order to identify changes in the list agreements. The usefulness of this measure is illustrated using gene rankings, and using data from two Danish ovarian cancer studies where we assess the within and between agreement of different statistical classification methods.
Collapse
Affiliation(s)
- Claus Thorn Ekstrøm
- Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5 B, DK-1014 Copenhagen K, Denmark
| | - Thomas Alexander Gerds
- Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5 B, DK-1014 Copenhagen K, Denmark
| | - Andreas Kryger Jensen
- Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5 B, DK-1014 Copenhagen K, Denmark
| |
Collapse
|
124
|
Tronik-Le Roux D, Sautreuil M, Bentriou M, Vérine J, Palma MB, Daouya M, Bouhidel F, Lemler S, LeMaoult J, Desgrandchamps F, Cournède PH, Carosella ED. Comprehensive landscape of immune-checkpoints uncovered in clear cell renal cell carcinoma reveals new and emerging therapeutic targets. Cancer Immunol Immunother 2020; 69:1237-1252. [PMID: 32166404 DOI: 10.1007/s00262-020-02530-x] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 02/18/2020] [Indexed: 12/18/2022]
Abstract
Clear cell renal cell carcinoma (ccRCC) constitutes the most common renal cell carcinoma subtype and has long been recognized as an immunogenic cancer. As such, significant attention has been directed toward optimizing immune-checkpoints (IC)-based therapies. Despite proven benefits, a substantial number of patients remain unresponsive to treatment, suggesting that yet unreported, immunosuppressive mechanisms coexist within tumors and their microenvironment. Here, we comprehensively analyzed and ranked forty-four immune-checkpoints expressed in ccRCC on the basis of in-depth analysis of RNAseq data collected from the TCGA database and advanced statistical methods designed to obtain the group of checkpoints that best discriminates tumor from healthy tissues. Immunohistochemistry and flow cytometry confirmed and enlarged the bioinformatics results. In particular, by using the recursive feature elimination method, we show that HLA-G, B7H3, PDL-1 and ILT2 are the most relevant genes that characterize ccRCC. Notably, ILT2 expression was detected for the first time on tumor cells. The levels of other ligand-receptor pairs such as CD70:CD27; 4-1BB:4-1BBL; CD40:CD40L; CD86:CTLA4; MHC-II:Lag3; CD200:CD200R; CD244:CD48 were also found highly expressed in tumors compared to adjacent non-tumor tissues. Collectively, our approach provides a comprehensible classification of forty-four IC expressed in ccRCC, some of which were never reported before to be co-expressed in ccRCC. In addition, the algorithms used allowed identifying the most relevant group that best discriminates tumor from healthy tissues. The data can potentially assist on the choice of valuable immune-therapy targets which hold potential for the development of more effective anti-tumor treatments.
Collapse
Affiliation(s)
- Diana Tronik-Le Roux
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France. .,Université de paris, U976 HIPI Unit, Institut de Recherche Saint-Louis, 75010, Paris, France. .,CEA, Direction de La Recherche Fondamentale, Service de Recherche en Hémato-Immunologie, Hôpital Saint-Louis, IUH, 1, avenue Claude Vellefaux, 75010, Paris, France.
| | - Mathilde Sautreuil
- Laboratory of Mathematics and Informatics (MICS), CentraleSupélec, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Mahmoud Bentriou
- Laboratory of Mathematics and Informatics (MICS), CentraleSupélec, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Jérôme Vérine
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France.,Service D'Anatomo-Pathologie, AP-HP, Hôpital Saint-Louis, Paris, France
| | - Maria Belén Palma
- Cátedra de Citología, Histología Y Embriología A, Facultad de Ciencias Médicas, UNLP, Buenos Aires, Argentina
| | - Marina Daouya
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France.,Université de paris, U976 HIPI Unit, Institut de Recherche Saint-Louis, 75010, Paris, France
| | - Fatiha Bouhidel
- Service D'Anatomo-Pathologie, AP-HP, Hôpital Saint-Louis, Paris, France
| | - Sarah Lemler
- Laboratory of Mathematics and Informatics (MICS), CentraleSupélec, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Joel LeMaoult
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France.,Université de paris, U976 HIPI Unit, Institut de Recherche Saint-Louis, 75010, Paris, France
| | - François Desgrandchamps
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France.,Service D'Urologie, AP-HP, Hôpital Saint-Louis, Paris, France
| | - Paul-Henry Cournède
- Laboratory of Mathematics and Informatics (MICS), CentraleSupélec, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Edgardo D Carosella
- Commissariat à L'Energie Atomique Et Aux Energies Alternatives (CEA), Direction de La Recherche Fondamentale (DRF), Service de Recherche en Hémato-Immunologie (SRHI), Hôpital Saint-Louis, Paris, France.,Université de paris, U976 HIPI Unit, Institut de Recherche Saint-Louis, 75010, Paris, France
| |
Collapse
|
125
|
Abstract
OBJECTIVES Modern critical care amasses unprecedented amounts of clinical data-so called "big data"-on a minute-by-minute basis. Innovative processing of these data has the potential to revolutionize clinical prognostics and decision support in the care of the critically ill but also forces clinicians to depend on new and complex tools of which they may have limited understanding and over which they have little control. This concise review aims to provide bedside clinicians with ways to think about common methods being used to extract information from clinical big datasets and to judge the quality and utility of that information. DATA SOURCES We searched the free-access search engines PubMed and Google Scholar using the MeSH terms "big data", "prediction", and "intensive care" with iterations of a range of additional potentially associated factors, along with published bibliographies, to find papers suggesting illustration of key points in the structuring and analysis of clinical "big data," with special focus on outcomes prediction and major clinical concerns in critical care. STUDY SELECTION Three reviewers independently screened preliminary citation lists. DATA EXTRACTION Summary data were tabulated for review. DATA SYNTHESIS To date, most relevant big data research has focused on development of and attempts to validate patient outcome scoring systems and has yet to fully make use of the potential for automation and novel uses of continuous data streams such as those available from clinical care monitoring devices. CONCLUSIONS Realizing the potential for big data to improve critical care patient outcomes will require unprecedented team building across disparate competencies. It will also require clinicians to develop statistical awareness and thinking as yet another critical judgment skill they bring to their patients' bedsides and to the array of evidence presented to them about their patients over the course of care.
Collapse
|
126
|
Xie X, Zhang H, Wang J, Chang Q, Wang J, Pal NR. Learning Optimized Structure of Neural Networks by Hidden Node Pruning With L 1 Regularization. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1333-1346. [PMID: 31765323 DOI: 10.1109/tcyb.2019.2950105] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We propose three different methods to determine the optimal number of hidden nodes based on L1 regularization for a multilayer perceptron network. The first two methods, respectively, use a set of multiplier functions and multipliers for the hidden-layer nodes and implement the L1 regularization on those, while the third method equipped with the same multipliers uses a smoothing approximation of the L1 regularization. Each of these methods begins with a given number of hidden nodes, then the network is trained to obtain an optimal architecture discarding redundant hidden nodes using the multiplier functions or multipliers. A simple and generic method, namely, the matrix-based convergence proving method (MCPM), is introduced to prove the weak and strong convergence of the presented smoothing algorithms. The performance of the three pruning methods has been tested on 11 different classification datasets. The results demonstrate the efficient pruning abilities and competitive generalization by the proposed methods. The theoretical results are also validated by the results.
Collapse
|
127
|
Identification of Susceptibility Genes in Hepatic Cancer Using Whole Exome Sequencing and Risk Prediction Model Construction. REV ROMANA MED LAB 2020. [DOI: 10.2478/rrlm-2020-0008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Objective: To identify the susceptible single nucleotide polymorphisms (SNPs) loci in HCC patients in Guangxi Region, screen biomarkers from differential SNPs loci by using predictors, and establish risk prediction models for HCC, to provide a basis of screening high-risk individuals of HCC.
Methods: Blood sample and clinical data of 50 normal participants and 50 hepatic cancer (HCC) patients in Rui Kang Hospital affiliated to Guangxi University of Traditional Chinese Medicine were collected. Normal participants and HCC patients were assigned to training set and testing set, respectively. Whole Exome Sequencing (WES) technique was employed to compare the exon sequence of the normal participants and HCC patients. Five predictors were used to screen the biomarkers and construct HCC prediction models. The prediction models were validated with both training and testing set.
Results: Two-hundred seventy SNPs were identified to be significantly different from HCC, among which 100 SNPs were selected as biomarkers for prediction models. Five prediction models constructed with the 100 SNPs showed good sensitivity and specificity for HCC prediction among the training set and testing set.
Conclusion: A series of SNPs were identified as susceptible genes for HCC. Some of these SNPs including CNN2, CD177, KMT2C, and HLADQB1 were consistent with the previously identified polymorphisms by targeted genes examination. The prediction models constructed with part of those SNPs could accurately predict HCC development.
Collapse
|
128
|
Koçhan N, Tutuncu GY, Smyth GK, Gandolfo LC, Giner G. qtQDA: quantile transformed quadratic discriminant analysis for high-dimensional RNA-seq data. PeerJ 2020; 7:e8260. [PMID: 31976167 PMCID: PMC6967023 DOI: 10.7717/peerj.8260] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Accepted: 11/20/2019] [Indexed: 11/26/2022] Open
Abstract
Classification on the basis of gene expression data derived from RNA-seq promises to become an important part of modern medicine. We propose a new classification method based on a model where the data is marginally negative binomial but dependent, thereby incorporating the dependence known to be present between measurements from different genes. The method, called qtQDA, works by first performing a quantile transformation (qt) then applying Gaussian quadratic discriminant analysis (QDA) using regularized covariance matrix estimates. We show that qtQDA has excellent performance when applied to real data sets and has advantages over some existing approaches. An R package implementing the method is also available on https://github.com/goknurginer/qtQDA.
Collapse
Affiliation(s)
- Necla Koçhan
- Department of Mathematics, Izmir University of Economics, Izmir, Turkey
| | - G Yazgi Tutuncu
- Department of Mathematics, Izmir University of Economics, Izmir, Turkey
| | - Gordon K Smyth
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia.,School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia
| | - Luke C Gandolfo
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia.,School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia
| | - Göknur Giner
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia.,Department of Medical Biology, University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
129
|
A topological approach for cancer subtyping from gene expression data. J Biomed Inform 2020; 102:103357. [PMID: 31893527 DOI: 10.1016/j.jbi.2019.103357] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 11/27/2019] [Accepted: 12/12/2019] [Indexed: 12/27/2022]
Abstract
BACKGROUND Gene expression data contains key information which can be used for subtyping cancer patients. However, computational methods suffer from 'curse of dimensionality' due to very high dimensionality of omics data and therefore are not able to clearly distinguish between the discovered subtypes in terms of separation of survival plots. METHODS To address this we propose a framework based on Topological Mapper algorithm. The novelty of this work is that we suggest a method for defining the filter function on which the mapper algorithm heavily depends. Survival analysis of the discovered cancer subtypes is carried out and evaluated in terms of minimum pairwise separation between the Kaplan-Meier plots. Furthermore, we present a method to measure the separation between the discovered subtypes based on hazard ratios. RESULTS Five cancer genomics datasets obtained from The Cancer Genome Atlas portal have been used for comparisons with Robust Sparse Correlation-Otrimle (RSC-Otrimle) algorithm and Similarity Network Fusion(SNF). Comparisons show that the minimum pairwise life expectancy difference (in days) between the discovered subtypes for lung, colon, breast, glioblastoma and kidney cancers is 107, 204, 20, 88 and 425 days, respectively, for the proposed methodology whereas it is only 69, 43, 6, 61 and 282 days for RSC-Otrimle and 9, 95, 18, 60 and 148 days for SNF. Hazard ratio analysis also shows that the proposed methodology performs better in four of the five datasets. A visual inspection of Kaplan-Meier plots reveals that the proposed methodology achieves lesser overlap in Kaplan-Meier plots especially for lung, breast and kidney cases. Furthermore, relevant genetic pathways for each subtype have been obtained and pathways which can be possible targets for treatment have been discussed. CONCLUSION The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of Precision Medicine.
Collapse
|
130
|
Warren S, Danaher P, Mashadi-Hossein A, Skewis L, Wallden B, Ferree S, Cesano A. Development of Gene Expression-Based Biomarkers on the nCounter ® Platform for Immuno-Oncology Applications. Methods Mol Biol 2020; 2055:273-300. [PMID: 31502157 DOI: 10.1007/978-1-4939-9773-2_13] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Biomarkers based on transcriptional profiling can be useful in the measurement of complex and/or dynamic physiological states where other profiling strategies such as genomic or proteomic characterization are not able to adequately measure the biology. One particular advantage of transcriptional biomarkers is the ease with which they can be measured in the clinical setting using robust platforms such as the NanoString nCounter system. The nCounter platform enables digital quantitation of multiplexed RNA from small amounts of blood, formalin-fixed, paraffin-embedded tumors, or other such biological samples that are readily available from patients, and the chapter uses it as the primary example for diagnostic assay development. However, development of diagnostic assays based on RNA biomarkers on any platform requires careful consideration of all aspects of the final clinical assay a priori, as well as design and execution of the development program in a way that will maximize likelihood of future success. This chapter introduces transcriptional biomarkers and provides an overview of the design and development process that will lead to a locked diagnostic assay that is ready for validation of clinical utility.
Collapse
Affiliation(s)
- Sarah Warren
- NanoString Technologies, Inc., Seattle, WA, USA.
| | | | | | | | | | - Sean Ferree
- NanoString Technologies, Inc., Seattle, WA, USA
| | - Alessandra Cesano
- NanoString Technologies, Inc., Seattle, WA, USA
- ESSA Pharma, South San Francisco, CA, USA
| |
Collapse
|
131
|
Rahaman MM, Ahsan MA, Chen M. Data-mining Techniques for Image-based Plant Phenotypic Traits Identification and Classification. Sci Rep 2019; 9:19526. [PMID: 31862925 PMCID: PMC6925301 DOI: 10.1038/s41598-019-55609-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Accepted: 11/21/2019] [Indexed: 11/09/2022] Open
Abstract
Statistical data-mining (DM) and machine learning (ML) are promising tools to assist in the analysis of complex dataset. In recent decades, in the precision of agricultural development, plant phenomics study is crucial for high-throughput phenotyping of local crop cultivars. Therefore, integrated or a new analytical approach is needed to deal with these phenomics data. We proposed a statistical framework for the analysis of phenomics data by integrating DM and ML methods. The most popular supervised ML methods; Linear Discriminant Analysis (LDA), Random Forest (RF), Support Vector Machine with linear (SVM-l) and radial basis (SVM-r) kernel are used for classification/prediction plant status (stress/non-stress) to validate our proposed approach. Several simulated and real plant phenotype datasets were analyzed. The results described the significant contribution of the features (selected by our proposed approach) throughout the analysis. In this study, we showed that the proposed approach removed phenotype data analysis complexity, reduced computational time of ML algorithms, and increased prediction accuracy.
Collapse
Affiliation(s)
- Md Matiur Rahaman
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.,Department of Statistics, Faculty of Science, Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj, 8100, Bangladesh
| | - Md Asif Ahsan
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, 310058, China.
| |
Collapse
|
132
|
Wu Y, Qin Y, Zhu M. High‐dimensional covariance matrix estimation using a low‐rank and diagonal decomposition. CAN J STAT 2019. [DOI: 10.1002/cjs.11532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yilei Wu
- Department of Statistics and Actuarial ScienceUniversity of Waterloo, 200 University Avenue WestWaterloo Ontario Canada N2L 3G1
| | - Yingli Qin
- Department of Statistics and Actuarial ScienceUniversity of Waterloo, 200 University Avenue WestWaterloo Ontario Canada N2L 3G1
| | - Mu Zhu
- Department of Statistics and Actuarial ScienceUniversity of Waterloo, 200 University Avenue WestWaterloo Ontario Canada N2L 3G1
| |
Collapse
|
133
|
Bhadra A, Datta J, Polson NG, Willard BT. The Horseshoe-Like Regularization for Feature Subset Selection. SANKHYA-SERIES B-APPLIED AND INTERDISCIPLINARY STATISTICS 2019. [DOI: 10.1007/s13571-019-00217-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
134
|
Crook OM, Gatto L, Kirk PD. Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2018-0065/sagmb-2018-0065.xml. [PMID: 31829970 PMCID: PMC7614016 DOI: 10.1515/sagmb-2018-0065] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.
Collapse
Affiliation(s)
- Oliver M. Crook
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK,Department of Biochemistry, Cambridge Centre for Proteomics, University of Cambridge, Cambridge, UK,MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | | | - Paul D.W. Kirk
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK,University of Cambridge, Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), Cambridge Biomedical Campus Cambridge, United Kingdom of Great Britain and Northern Ireland
| |
Collapse
|
135
|
3-Dimensional facial expression recognition in human using multi-points warping. BMC Bioinformatics 2019; 20:619. [PMID: 31791234 PMCID: PMC6889223 DOI: 10.1186/s12859-019-3153-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 10/11/2019] [Indexed: 11/28/2022] Open
Abstract
Background Expression in H-sapiens plays a remarkable role when it comes to social communication. The identification of this expression by human beings is relatively easy and accurate. However, achieving the same result in 3D by machine remains a challenge in computer vision. This is due to the current challenges facing facial data acquisition in 3D; such as lack of homology and complex mathematical analysis for facial point digitization. This study proposes facial expression recognition in human with the application of Multi-points Warping for 3D facial landmark by building a template mesh as a reference object. This template mesh is thereby applied to each of the target mesh on Stirling/ESRC and Bosphorus datasets. The semi-landmarks are allowed to slide along tangents to the curves and surfaces until the bending energy between a template and a target form is minimal and localization error is assessed using Procrustes ANOVA. By using Principal Component Analysis (PCA) for feature selection, classification is done using Linear Discriminant Analysis (LDA). Result The localization error is validated on the two datasets with superior performance over the state-of-the-art methods and variation in the expression is visualized using Principal Components (PCs). The deformations show various expression regions in the faces. The results indicate that Sad expression has the lowest recognition accuracy on both datasets. The classifier achieved a recognition accuracy of 99.58 and 99.32% on Stirling/ESRC and Bosphorus, respectively. Conclusion The results demonstrate that the method is robust and in agreement with the state-of-the-art results.
Collapse
|
136
|
Affiliation(s)
- Lo‐Bin Chang
- Department of StatisticsThe Ohio State UniversityColumbus OH 43210‐1326 U.S.A
| |
Collapse
|
137
|
Kang X, Deng X. An improved modified cholesky decomposition approach for precision matrix estimation. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1687701] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Xiaoning Kang
- International Business College and Institute of Supply Chain Analytics, Dongbei University of Finance and Economics, Dalian, People’s Republic of China
| | - Xinwei Deng
- Department of Statistics, Virginia Tech, Blacksburg, VA, USA
| |
Collapse
|
138
|
Paul A, Sil J. Identification of Differentially Expressed Genes to Establish New Biomarker for Cancer Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1970-1985. [PMID: 29994718 DOI: 10.1109/tcbb.2018.2837095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The goal of the human genome project is to integrate genetic information into different clinical therapies. To achieve this goal, different computational algorithms are devised for identifying the biomarker genes, cause of complex diseases. However, most of the methods developed so far using DNA microarray data lack in interpreting biological findings and are less accurate in disease prediction. In the paper, we propose two parameters risk_factor and confusion_factor to identify the biologically significant genes for cancer development. First, we evaluate risk_factor of each gene and the genes with nonzero risk_factor result misclassification of data, therefore removed. Next, we calculate confusion_factor of the remaining genes which determines confusion of a gene in prediction due to closeness of the samples in the cancer and normal classes. We apply nondominated sorting genetic algorithm (NSGA-II) to select the maximally uncorrelated differentially expressed genes in the cancer class with minimum confusion_factor. The proposed Gene Selection Explore (GSE) algorithm is compared to well established feature selection algorithms using 10 microarray data with respect to sensitivity, specificity, and accuracy. The identified genes appear in KEGG pathway and have several biological importance.
Collapse
|
139
|
Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK
- Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK
| | | | - Ross D. King
- Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
140
|
Liang R, Xie J, Zhang C, Zhang M, Huang H, Huo H, Cao X, Niu B. Identifying Cancer Targets Based on Machine Learning Methods via Chou's 5-steps Rule and General Pseudo Components. Curr Top Med Chem 2019; 19:2301-2317. [PMID: 31622219 DOI: 10.2174/1568026619666191016155543] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Revised: 07/19/2019] [Accepted: 08/26/2019] [Indexed: 01/09/2023]
Abstract
In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of 'big data' derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.
Collapse
Affiliation(s)
- Ruirui Liang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Jiayang Xie
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Chi Zhang
- Foshan Huaxia Eye Hospital, Huaxia Eye Hospital Group, Foshan 528000, China
| | - Mengying Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Hai Huang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Haizhong Huo
- Department of General Surgery, Shanghai Ninth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China
| | - Xin Cao
- Zhongshan Hospital, Institute of Clinical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| |
Collapse
|
141
|
Rosenblatt JD, Benjamini Y, Gilron R, Mukamel R, Goeman JJ. Better-than-chance classification for signal detection. Biostatistics 2019; 22:365-380. [PMID: 31612223 DOI: 10.1093/biostatistics/kxz035] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 08/09/2019] [Accepted: 08/14/2019] [Indexed: 11/13/2022] Open
Abstract
The estimated accuracy of a classifier is a random quantity with variability. A common practice in supervised machine learning, is thus to test if the estimated accuracy is significantly better than chance level. This method of signal detection is particularly popular in neuroimaging and genetics. We provide evidence that using a classifier's accuracy as a test statistic can be an underpowered strategy for finding differences between populations, compared to a bona fide statistical test. It is also computationally more demanding than a statistical test. Via simulation, we compare test statistics that are based on classification accuracy, to others based on multivariate test statistics. We find that the probability of detecting differences between two distributions is lower for accuracy-based statistics. We examine several candidate causes for the low power of accuracy-tests. These causes include: the discrete nature of the accuracy-test statistic, the type of signal accuracy-tests are designed to detect, their inefficient use of the data, and their suboptimal regularization. When the purpose of the analysis is the evaluation of a particular classifier, not signal detection, we suggest several improvements to increase power. In particular, to replace V-fold cross-validation with the Leave-One-Out Bootstrap.
Collapse
Affiliation(s)
- Jonathan D Rosenblatt
- Department of IE&M and Zlotowsky Center for Neuroscience, Ben Gurion University of the Negev, P.O. 653, Beer Sheva, 84105 Israel
| | - Yuval Benjamini
- Department of Statistics, Hebrew University, Mount Scopus, Jerusalem 9190501, Israel
| | - Roee Gilron
- Movement Disorders and Neuromodulation Center, University of California, 1635 Divisadero St, San Francisco, CA 94115, USA
| | - Roy Mukamel
- School of Psychological Sciences, and Sagol School of Neuroscience, Tel-Aviv University, Tel-Aviv 69978, Israel
| | - Jelle J Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center, Postbus 9600, 2300 RC Leiden, The Netherlands
| |
Collapse
|
142
|
Preethi S, Aishwarya P. Combining Wavelet Texture Features and Deep Neural Network for Tumor Detection and Segmentation Over MRI. JOURNAL OF INTELLIGENT SYSTEMS 2019. [DOI: 10.1515/jisys-2017-0090] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
A brain tumor is one of the main reasons for death among other kinds of cancer because the brain is a very sensitive, complex, and central portion of the body. Proper and timely diagnosis can prolong the life of a person to some extent. Consequently, in this paper, we have proposed a brain tumor classification scheme on the basis of combining wavelet texture features and deep neural networks (DNNs). Normally, the system comprises four modules: (i) feature extraction, (ii) feature selection, (iii) tumor classification, and (iv) segmentation. Primarily, we eliminate the noise from the image. Then, the feature matrix is produced by combining wavelet texture features [gray-level co-occurrence matrix (GLCM)+wavelet GLCM]. Following that, we select the relevant features with the help of the oppositional flower pollination algorithm (OFPA) because a high number of features are major obstacles for classification. Then, we categorize the brain image based on the selected features using the DNN. After the classification procedure, the projected scheme extracts the tumor region from the tumor images with the help of the possibilistic fuzzy c-means clustering (PFCM) algorithm. The experimentation results show that the proposed system attains the better result associated with the available methods.
Collapse
|
143
|
Li W, Lederer J. Tuning parameter calibration for ℓ1-regularized logistic regression. J Stat Plan Inference 2019. [DOI: 10.1016/j.jspi.2019.01.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
144
|
Xu K, Hao X. A nonparametric test for block-diagonal covariance structure in high dimension and small samples. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.05.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
145
|
A Review of Computational Methods for Clustering Genes with Similar Biological Functions. Processes (Basel) 2019. [DOI: 10.3390/pr7090550] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters.
Collapse
|
146
|
Romanes SE, Ormerod JT, Yang JYH. Diagonal Discriminant Analysis With Feature Selection for High-Dimensional Data. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1637748] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Sarah E. Romanes
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| | - John T. Ormerod
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
- ARC Centre of Excellence for Mathematical & Statistical Frontiers, The University of Melbourne, Parkville VIC, Australia
| | - Jean Y. H. Yang
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
- The Judith and David Coffey Life Lab, Charles Perkins Centre, University of Sydney, Sydney, Australia
| |
Collapse
|
147
|
Du L, Liu K, Yao X, Risacher SL, Guo L, Saykin AJ, Shen L. DIAGNOSIS STATUS GUIDED BRAIN IMAGING GENETICS VIA INTEGRATED REGRESSION AND SPARSE CANONICAL CORRELATION ANALYSIS. PROCEEDINGS. IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING 2019; 2019:356-359. [PMID: 31844486 DOI: 10.1109/isbi.2019.8759489] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Brain imaging genetics use the imaging quantitative traits (QTs) as intermediate endophenotypes to identify the genetic basis of the brain structure, function and abnormality. The regression and canonical correlation analysis (CCA) coupled with sparsity regularization are widely used in imaging genetics. The regression only selects relevant features for predictors. SCCA overcomes this but is unsupervised and thus could not make use of the diagnosis information. We propose a novel method integrating regression and SCCA together to construct a supervised sparse bi-multivariate learning model. The regression part plays a role of providing guidance for imaging QTs selection, and the SCCA part is focused on selecting relevant genetic markers and imaging QTs. We propose an efficient algorithm based on the alternative search method. Our method obtains better feature selection results than both regression and SCCA on both synthetic and real neuroimaging data. This demonstrates that our method is a promising bi-multivariate tool for brain imaging genetics.
Collapse
Affiliation(s)
- Lei Du
- School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Kefei Liu
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Xiaohui Yao
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | - Lei Guo
- School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Andrew J Saykin
- Indiana University School of Medicine, Indianapolis, IN, USA
| | - Li Shen
- University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | |
Collapse
|
148
|
Goksuluk D, Zararsiz G, Korkmaz S, Eldem V, Zararsiz GE, Ozcetin E, Ozturk A, Karaagaoglu AE. MLSeq: Machine learning interface for RNA-sequencing data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 175:223-231. [PMID: 31104710 DOI: 10.1016/j.cmpb.2019.04.007] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2018] [Revised: 03/21/2019] [Accepted: 04/08/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. METHODS Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. MATERIALS Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set. RESULTS The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features. CONCLUSION MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data.
Collapse
Affiliation(s)
- Dincer Goksuluk
- Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Gokmen Zararsiz
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey.
| | - Selcuk Korkmaz
- Department of Biostatistics, School of Medicine, Trakya University, 22030, Edirne, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Vahap Eldem
- Department of Biology, Faculty of Science, Istanbul University, 34452, Istanbul, Turkey
| | - Gozde Erturk Zararsiz
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey
| | - Erdener Ozcetin
- Department of Industrial Engineering, Faculty of Engineering, Hitit University, 19030, Corum, Turkey
| | - Ahmet Ozturk
- Department of Biostatistics, School of Medicine, Erciyes University, 38030, Kayseri, Turkey; Turcosa Analytics Solutions Ltd. Co., Erciyes Teknopark 5, 38030, Kayseri, Turkey
| | - Ahmet Ergun Karaagaoglu
- Department of Biostatistics, School of Medicine, Hacettepe University, 06100, Ankara, Turkey
| |
Collapse
|
149
|
Yoo TK, Ryu IH, Lee G, Kim Y, Kim JK, Lee IS, Kim JS, Rim TH. Adopting machine learning to automatically identify candidate patients for corneal refractive surgery. NPJ Digit Med 2019; 2:59. [PMID: 31304405 PMCID: PMC6586803 DOI: 10.1038/s41746-019-0135-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 05/30/2019] [Indexed: 12/26/2022] Open
Abstract
Recently, it has become more important to screen candidates that undergo corneal refractive surgery to prevent complications. Until now, there is still no definitive screening method to confront the possibility of a misdiagnosis. We evaluate the possibilities of machine learning as a clinical decision support to determine the suitability to corneal refractive surgery. A machine learning architecture was built with the aim of identifying candidates combining the large multi-instrument data from patients and clinical decisions of highly experienced experts. Five heterogeneous algorithms were used to predict candidates for surgery. Subsequently, an ensemble classifier was developed to improve the performance. Training (10,561 subjects) and internal validation (2640 subjects) were conducted using subjects who had visited between 2016 and 2017. External validation (5279 subjects) was performed using subjects who had visited in 2018. The best model, i.e., the ensemble classifier, had a high prediction performance with the area under the receiver operating characteristic curves of 0.983 (95% CI, 0.977-0.987) and 0.972 (95% CI, 0.967-0.976) when tested in the internal and external validation set, respectively. The machine learning models were statistically superior to classic methods including the percentage of tissue ablated and the Randleman ectatic score. Our model was able to correctly reclassify a patient with postoperative ectasia as an ectasia-risk group. Machine learning algorithms using a wide range of preoperative information achieved a comparable performance to screen candidates for corneal refractive surgery. An automated machine learning analysis of preoperative data can provide a safe and reliable clinical decision for refractive surgery.
Collapse
Affiliation(s)
- Tae Keun Yoo
- B&VIIt Eye Center, Seoul, South Korea.,2Institute of Vision Research, Department of Ophthalmology, Yonsei University College of Medicine, Seoul, South Korea
| | | | | | | | | | | | | | - Tyler Hyungtaek Rim
- Singapore Eye Research Institute, Singapore National Eye Centre, Duke-NUS Medical School, Singapore, Singapore.,5Department of Ophthalmology, Yonsei University College of Medicine, Graduate School, Seoul, Korea
| |
Collapse
|
150
|
Wang X, Zhang HH, Wu Y. Multiclass Probability Estimation With Support Vector Machines. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1585260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
| | - Hao Helen Zhang
- Department of Mathematics, University of Arizona, Tucson, AZ
| | - Yichao Wu
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL
| |
Collapse
|