51
|
Miao R, Xu K. Joint test for homogeneity of high-dimensional means and covariance matrices using maximum-type statistics. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2037641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Runsheng Miao
- College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
| | - Kai Xu
- School of Mathematics and Statistics, Anhui Normal University, Wuhu, China
| |
Collapse
|
52
|
Dong J, Peng L, Yang X, Zhang Z, Zhang P. XGBoost-based intelligence yield prediction and reaction factors analysis of amination reaction. J Comput Chem 2022; 43:289-302. [PMID: 34862652 DOI: 10.1002/jcc.26791] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Revised: 11/19/2021] [Accepted: 11/22/2021] [Indexed: 11/10/2022]
Abstract
Buchwald-Hartwig amination reaction catalyzed by palladium plays an important role in drug synthesis. In the last few years, machine learning-assisted strategies emerged and quickly gained attention. In this article, an importance and relevance-based integrated feature screening method is proposed to effectively filter high-dimensional feature descriptor data. Then, a regularized machine learning boosting tree model, eXtreme Gradient Boosting, is introduced to intelligently predict reaction performance in multidimensional chemistry space. Furthermore, convergence, interpretability, generalization, and the internal association between reaction conditions and yields are excavated, which provides intelligent assistance for the optimal design of coupling reaction system and evaluating the reaction conditions. Compared with recently published results, the proposed method requires fewer feature descriptors, takes less time, and achieves more accurate prediction accuracy.
Collapse
Affiliation(s)
- Jing Dong
- Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms, School of Mathematics and Statistics, Henan University, Kaifeng, China
| | - Lichao Peng
- National & Local Joint Engineering Research Center for Applied Technology of Hybrid Nanomaterials, Henan University, Kaifeng, China
| | - Xiaohui Yang
- Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms, School of Mathematics and Statistics, Henan University, Kaifeng, China
| | - Zelin Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Puyu Zhang
- College of Chemistry and Chemical Engineering, Henan University, Kaifeng, China
| |
Collapse
|
53
|
Cao X, Gregory K, Wang D. Inference for sparse linear regression based on the leave-one-covariate-out solution path. COMMUN STAT-THEOR M 2022; 52:6640-6657. [PMID: 37840573 PMCID: PMC10572792 DOI: 10.1080/03610926.2022.2032171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Accepted: 01/17/2022] [Indexed: 11/03/2022]
Abstract
We propose a new measure of variable importance in high-dimensional regression based on the change in the LASSO solution path when one covariate is left out. The proposed procedure provides a novel way to calculate variable importance and conduct variable screening. In addition, our procedure allows for the construction of p-values for testing whether each coe cient is equal to zero as well as for testing hypotheses involving multiple regression coefficients simultaneously; bootstrap techniques are used to construct the null distribution. For low-dimensional linear models, our method can achieve higher power than the t-test. Extensive simulations are provided to show the effectiveness of our method. In the high-dimensional setting, our proposed solution path based test achieves greater power than some other recently developed high-dimensional inference methods. We extend our method to logistic regression and demonstrate in simulation that our leave-one-covariate-out solution path tests can provide accurate p-values.
Collapse
Affiliation(s)
- Xiangyang Cao
- 216 LeConte College, 1523 Greene St, Columbia, SC 29201, USA
| | - Karl Gregory
- 216 LeConte College, 1523 Greene St, Columbia, SC 29201, USA
| | - Dewei Wang
- 216 LeConte College, 1523 Greene St, Columbia, SC 29201, USA
| |
Collapse
|
54
|
Gan Y, Ma J, Peng H, Zhu H, Ju Q, Chen Y. Ten ignored questions for stress psychology research. Psych J 2022; 11:132-141. [PMID: 35112503 DOI: 10.1002/pchj.520] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Accepted: 12/30/2021] [Indexed: 01/06/2023]
Abstract
Stress psychology is an interesting and important interdisciplinary research field. In this perspective article, we briefly discuss 10 challenges related to the conceptual definition, research methodology, and translation in the field of stress that do not receive sufficient attention or are ignored entirely. Future research should attempt to integrate a comprehensive stress conceptual framework into a multidimensional comprehensive stress model, incorporating subjective and objective indicators as comprehensive measures. The popularity of machine learning, cognitive neuroscience, and gene epigenetics is a promising approach that brings innovation to the field of stress psychology. The development of wearable devices that precisely record physiological signals to assess stress responses in naturalistic situations, standardize real-life stressors, and measure baselines presents challenges to address in the future. Conducting large individualized and digital intervention studies could be crucial steps in enhancing the translation of research.
Collapse
Affiliation(s)
- Yiqun Gan
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| | - Jinjin Ma
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| | - Huini Peng
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| | - Huanya Zhu
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| | - Qianqian Ju
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| | - Yidi Chen
- School of Psychological Cognitive Sciences, and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China
| |
Collapse
|
55
|
Bajo-Morales J, Galvez JM, Prieto-Prieto JC, Herrera LJ, Rojas I, Castillo-Secilla D. Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers
Using Machine Learning Techniques Applied to Lung Cancer. Curr Bioinform 2022. [DOI: 10.2174/1574893616666211005114934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Background:
Nowadays, gene expression analysis is one of the most promising pillars for
understanding and uncovering the mechanisms underlying the development and spread of cancer. In this
sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market
due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained
from older technologies, such as Microarray, which could still be useful to extract relevant
knowledge.
Methods:
Throughout this research, a complete machine learning methodology to cross-evaluate the
compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented.
In order to show a real application of the designed pipeline, a lung cancer case study is addressed
by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic
datasets considered for our study have been obtained from the public repositories
NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried
out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic
technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples
belonging to these cancer subtypes have been developed.
Results:
The predictive models built using one technology are capable of discerning samples from a different
technology. The classification results are evaluated in terms of accuracy, F1-score and ROC
curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship
with lung cancer are reviewed, encountering strong biological evidence linking them to the disease.
Conclusion:
Our method has the capability of finding strong gene signatures which are also independent
of the transcriptomic technology used to develop the analysis. In addition, our article highlights the
potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies,
increasing the statistical significance of the results.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Juan Manuel Galvez
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Juan Carlos Prieto-Prieto
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada,Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada,Spain
| |
Collapse
|
56
|
Hulstaert E, Levanon K, Morlion A, Van Aelst S, Christidis AA, Zamar R, Anckaert J, Verniers K, Bahar-Shany K, Sapoznik S, Vandesompele J, Mestdagh P. RNA biomarkers from proximal liquid biopsy for diagnosis of ovarian cancer. Neoplasia 2022; 24:155-164. [PMID: 34998206 PMCID: PMC8740458 DOI: 10.1016/j.neo.2021.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 12/20/2021] [Indexed: 10/29/2022]
Abstract
BACKGROUND Most ovarian cancer patients are diagnosed at an advanced stage and have a high mortality rate. Current screening strategies fail to improve prognosis because markers that are sensitive for early stage disease are lacking. This medical need justifies the search for novel approaches using utero-tubal lavage as a proximal liquid biopsy. METHODS In this study, we explore the extracellular transcriptome of utero-tubal lavage fluid obtained from 26 ovarian cancer patients and 48 controls using messenger RNA (mRNA) capture and small RNA sequencing. RESULTS We observed an enrichment of ovarian and fallopian tube specific messenger RNAs in utero-tubal lavage fluid compared to other human biofluids. Over 300 mRNAs and 41 miRNAs were upregulated in ovarian cancer samples compared with controls. Upregulated genes were enriched for genes involved in cell cycle activation and proliferation, hinting at a tumor-derived signal. CONCLUSION This is a proof-of-principle that mRNA capture sequencing of utero-tubal lavage fluid is technically feasible, and that the extracellular transcriptome of utero-tubal lavage should be further explored in larger cohorts to assess the diagnostic value of the biomarkers identified in this study. IMPACT Proximal liquid biopsy from the gynecologic tract is a promising source for mRNA and miRNA biomarkers for diagnosis of early-stage ovarian cancer.
Collapse
Affiliation(s)
- Eva Hulstaert
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium; Department of Dermatology, Ghent University Hospital, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Keren Levanon
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel; Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, Israel
| | - Annelien Morlion
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | | | | | - Ruben Zamar
- Department of Statistics, University of British Columbia, Vancouver, Canada
| | - Jasper Anckaert
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Kimberly Verniers
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Keren Bahar-Shany
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Stav Sapoznik
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Jo Vandesompele
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Pieter Mestdagh
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium.
| |
Collapse
|
57
|
Krepel J, Kircher M, Kohls M, Jung K. Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jessica Krepel
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Magdalena Kircher
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Moritz Kohls
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Klaus Jung
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| |
Collapse
|
58
|
Zeng Y, Wei Z, Pan Z, Lu Y, Yang Y. A robust and scalable graph neural network for accurate single-cell classification. Brief Bioinform 2022; 23:6501353. [PMID: 35018408 DOI: 10.1093/bib/bbab570] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 12/01/2021] [Accepted: 12/11/2021] [Indexed: 12/25/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) techniques provide high-resolution data on cellular heterogeneity in diverse tissues, and a critical step for the data analysis is cell type identification. Traditional methods usually cluster the cells and manually identify cell clusters through marker genes, which is time-consuming and subjective. With the launch of several large-scale single-cell projects, millions of sequenced cells have been annotated and it is promising to transfer labels from the annotated datasets to newly generated datasets. One powerful way for the transferring is to learn cell relations through the graph neural network (GNN), but traditional GNNs are difficult to process millions of cells due to the expensive costs of the message-passing procedure at each training epoch. Here, we have developed a robust and scalable GNN-based method for accurate single-cell classification (GraphCS), where the graph is constructed to connect similar cells within and between labelled and unlabeled scRNA-seq datasets for propagation of shared information. To overcome the slow information propagation of GNN at each training epoch, the diffused information is pre-calculated via the approximate Generalized PageRank algorithm, enabling sublinear complexity over cell numbers. Compared with existing methods, GraphCS demonstrates better performance on simulated, cross-platform, cross-species and cross-omics scRNA-seq datasets. More importantly, our model provides a high speed and scalability on large datasets, and can achieve superior performance for 1 million cells within 50 min.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
59
|
Li J, Liang K, Song X. Logistic regression with adaptive sparse group lasso penalty and its application in acute leukemia diagnosis. Comput Biol Med 2021; 141:105154. [PMID: 34952336 DOI: 10.1016/j.compbiomed.2021.105154] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 12/14/2021] [Accepted: 12/15/2021] [Indexed: 01/15/2023]
Abstract
Cancer diagnosis based on gene expression profile data has attracted extensive attention in computational biology and medicine. It suffers from three challenges in practical applications: noise, gene grouping, and adaptive gene selection. This paper aims to solve the above problems by developing the logistic regression with adaptive sparse group lasso penalty (LR-ASGL). A noise information processing method for cancer gene expression profile data is first presented via robust principal component analysis. Genes are then divided into groups by performing weighted gene co-expression network analysis on the clean matrix. By approximating the relative value of the noise size, gene reliability criterion and robust evaluation criterion are proposed. Finally, LR-ASGL is presented for simultaneous cancer diagnosis and adaptive gene selection. The performance of the proposed method is compared with the other four methods in three simulation settings: Gaussian noise, uniformly distributed noise, and mixed noise. The acute leukemia data are adopted as an experimental example to demonstrate the advantages of LR-ASGL in prediction and gene selection.
Collapse
Affiliation(s)
- Juntao Li
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China.
| | - Ke Liang
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China.
| | - Xuekun Song
- College of Information Technology, Henan University of Chinese Medicine, Zhengzhou, 450046, China.
| |
Collapse
|
60
|
Machine learning approaches for classification of colorectal cancer with and without feature selection method on microarray data. GENE REPORTS 2021. [DOI: 10.1016/j.genrep.2021.101419] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
|
61
|
Xiong J, He W. Identification of survival relevant genes with measurement error in gene expression incorporated. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.2004424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Juan Xiong
- Health Science Center, Shengzhen University, Shengzhen, Guangdong, P. R. China
| | - Wenqing He
- University of Western Ontario, London, Ontario, Canada
| |
Collapse
|
62
|
Band-based similarity indices for gene expression classification and clustering. Sci Rep 2021; 11:21609. [PMID: 34732744 PMCID: PMC8566472 DOI: 10.1038/s41598-021-00678-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 10/11/2021] [Indexed: 11/16/2022] Open
Abstract
The concept of depth induces an ordering from centre outwards in multivariate data. Most depth definitions are unfeasible for dimensions larger than three or four, but the Modified Band Depth (MBD) is a notable exception that has proven to be a valuable tool in the analysis of high-dimensional gene expression data. This depth definition relates the centrality of each individual to its (partial) inclusion in all possible bands formed by elements of the data set. We assess (dis)similarity between pairs of observations by accounting for such bands and constructing binary matrices associated to each pair. From these, contingency tables are calculated and used to derive standard similarity indices. Our approach is computationally efficient and can be applied to bands formed by any number of observations from the data set. We have evaluated the performance of several band-based similarity indices with respect to that of other classical distances in standard classification and clustering tasks in a variety of simulated and real data sets. However, the use of the method is not restricted to these, the extension to other similarity coefficients being straightforward. Our experiments show the benefits of our technique, with some of the selected indices outperforming, among others, the Euclidean distance.
Collapse
|
63
|
Cellular, molecular, and therapeutic characterization of pilocarpine-induced temporal lobe epilepsy. Sci Rep 2021; 11:19102. [PMID: 34580351 PMCID: PMC8476594 DOI: 10.1038/s41598-021-98534-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 09/09/2021] [Indexed: 12/30/2022] Open
Abstract
Animal models have expanded our understanding of temporal lobe epilepsy (TLE). However, translating these to cell-specific druggable hypotheses is not explored. Herein, we conducted an integrative insilico-analysis of an available transcriptomics dataset obtained from animals with pilocarpine-induced-TLE. A set of 119 genes with subtle-to-moderate impact predicted most forms of epilepsy with ~ 97% accuracy and characteristically mapped to upregulated homeostatic and downregulated synaptic pathways. The deconvolution of cellular proportions revealed opposing changes in diverse cell types. The proportion of nonneuronal cells increased whereas that of interneurons, except for those expressing vasoactive intestinal peptide (Vip), decreased, and pyramidal neurons of the cornu-ammonis (CA) subfields showed the highest variation in proportion. A probabilistic Bayesian-network demonstrated an aberrant and oscillating physiological interaction between nonneuronal cells involved in the blood–brain-barrier and Vip interneurons in driving seizures, and their role was evaluated insilico using transcriptomic changes induced by valproic-acid, which showed opposing effects in the two cell-types. Additionally, we revealed novel epileptic and antiepileptic mechanisms and predicted drugs using causal inference, outperforming the present drug repurposing approaches. These well-powered findings not only expand the understanding of TLE and seizure oscillation, but also provide predictive biomarkers of epilepsy, cellular and causal micro-circuitry changes associated with it, and a drug-discovery method focusing on these events.
Collapse
|
64
|
Automated Raman Micro-Spectroscopy of Epithelial Cell Nuclei for High-Throughput Classification. Cancers (Basel) 2021; 13:cancers13194767. [PMID: 34638253 PMCID: PMC8507544 DOI: 10.3390/cancers13194767] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 09/15/2021] [Accepted: 09/16/2021] [Indexed: 11/25/2022] Open
Abstract
Simple Summary We demonstrate an automated Raman cytology system designed for high-throughput and reproducibility. The system uses a Raman spectroscopy system integrated into a conventional microscope, all controlled electronically via and open source software, Micro-Manager. The system can automatically identify and probe epithelial cell nuclei for Raman spectroscopy. 6426 HT1197 (high-grade bladder cancer) cell spectra, and 7499 RT112 (low-grade bladdercancer) cell spectra were recorded. The data was subsequently culled and processed for denoising and artifact removal. We demonstrate, using multivariate statistical analysis, that the cells can be distinguished, using a variety of approaches with accuracy, sensitivity and specificity in excess of 95%. Abstract Raman micro-spectroscopy is a powerful technique for the identification and classification of cancer cells and tissues. In recent years, the application of Raman spectroscopy to detect bladder, cervical, and oral cytological samples has been reported to have an accuracy greater than that of standard pathology. However, despite being entirely non-invasive and relatively inexpensive, the slow recording time, and lack of reproducibility have prevented the clinical adoption of the technology. Here, we present an automated Raman cytology system that can facilitate high-throughput screening and improve reproducibility. The proposed system is designed to be integrated directly into the standard pathology clinic, taking into account their methodologies and consumables. The system employs image processing algorithms and integrated hardware/software architectures in order to achieve automation and is tested using the ThinPrep standard, including the use of glass slides, and a number of bladder cancer cell lines. The entire automation process is implemented, using the open source Micro-Manager platform and is made freely available. We believe that this code can be readily integrated into existing commercial Raman micro-spectrometers.
Collapse
|
65
|
Wang J, Zhao Y, Tang LL, Mueller C, Li Q. A resample-replace lasso procedure for combining high-dimensional markers with limit of detection. J Appl Stat 2021; 49:4278-4293. [DOI: 10.1080/02664763.2021.1977785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Jinjuan Wang
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, People's Republic of China
| | - Yunpeng Zhao
- School of Mathematical and Natural Sciences, Arizona State University, Tempe, AZ, USA
| | - Larry L. Tang
- Department of Statistics and National Center for Forensic Science, University of Central Florida, Orlando, FL, USA
- Department of Statistics, Rehabilitation Medicine Department, NIH Clinical Center, Bethesda, MD, USA
| | | | - Qizhai Li
- LSC Academy of Mathematics and Systems Science, Chinese Academy of Sciences and University of Chinese Academy of Sciences, Beijing, People's Republic of China
| |
Collapse
|
66
|
Machine Learning for Light Sensor Calibration. SENSORS 2021; 21:s21186259. [PMID: 34577466 PMCID: PMC8473444 DOI: 10.3390/s21186259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Revised: 09/06/2021] [Accepted: 09/09/2021] [Indexed: 11/16/2022]
Abstract
Sunlight incident on the Earth's atmosphere is essential for life, and it is the driving force of a host of photo-chemical and environmental processes, such as the radiative heating of the atmosphere. We report the description and application of a physical methodology relative to how an ensemble of very low-cost sensors (with a total cost of <$20, less than 0.5% of the cost of the reference sensor) can be used to provide wavelength resolved irradiance spectra with a resolution of 1 nm between 360-780 nm by calibrating against a reference sensor using machine learning. These low-cost sensor ensembles are calibrated using machine learning and can effectively reproduce the observations made by an NIST calibrated reference instrument (Konica Minolta CL-500A with a cost of around USD 6000). The correlation coefficient between the reference sensor and the calibrated low-cost sensor ensemble has been optimized to have R2> 0.99. Both the circuits used and the code have been made publicly available. By accurately calibrating the low-cost sensors, we are able to distribute a large number of low-cost sensors in a neighborhood scale area. It provides unprecedented spatial and temporal insights into the micro-scale variability of the wavelength resolved irradiance, which is relevant for air quality, environmental and agronomy applications.
Collapse
|
67
|
Gumaei A, Sammouda R, Al-Rakhami M, AlSalman H, El-Zaart A. Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression. Health Informatics J 2021; 27:1460458221989402. [PMID: 33570011 DOI: 10.1177/1460458221989402] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Collapse
Affiliation(s)
- Abdu Gumaei
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia.,Taiz University, Yemen
| | | | - Mabrook Al-Rakhami
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia
| | | | | |
Collapse
|
68
|
Fischer S, Crow M, Harris BD, Gillis J. Scaling up reproducible research for single-cell transcriptomics using MetaNeighbor. Nat Protoc 2021; 16:4031-4067. [PMID: 34234317 PMCID: PMC8826496 DOI: 10.1038/s41596-021-00575-5] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Accepted: 05/25/2021] [Indexed: 02/06/2023]
Abstract
Single-cell RNA-sequencing data have significantly advanced the characterization of cell-type diversity and composition. However, cell-type definitions vary across data and analysis pipelines, raising concerns about cell-type validity and generalizability. With MetaNeighbor, we proposed an efficient and robust quantification of cell-type replicability that preserves dataset independence and is highly scalable compared to dataset integration. In this protocol, we show how MetaNeighbor can be used to characterize cell-type replicability by following a simple three-step procedure: gene filtering, neighbor voting and visualization. We show how these steps can be tailored to quantify cell-type replicability, determine gene sets that contribute to cell-type identity and pretrain a model on a reference taxonomy to rapidly assess newly generated data. The protocol is based on an open-source R package available from Bioconductor and GitHub, requires basic familiarity with Rstudio or the R command line and can typically be run in <5 min for millions of cells.
Collapse
Affiliation(s)
- Stephan Fischer
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Megan Crow
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Benjamin D Harris
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
- Watson School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
- Watson School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|
69
|
Lian H. Learning Rate for Convex Support Tensor Machines. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3755-3760. [PMID: 32833645 DOI: 10.1109/tnnls.2020.3015477] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Tensors are increasingly encountered in prediction problems. We extend previous results for high-dimensional least-squares convex tensor regression to classification problems with a hinge loss and establish its asymptotic statistical properties. Based on a general convex decomposable penalty, the rate depends on both the intrinsic dimension and the Rademacher complexity of the class of linear functions of tensor predictors.
Collapse
|
70
|
A Holistic Performance Comparison for Lung Cancer Classification Using Swarm Intelligence Techniques. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:6680424. [PMID: 34373776 PMCID: PMC8349254 DOI: 10.1155/2021/6680424] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 07/17/2021] [Indexed: 12/22/2022]
Abstract
In the field of bioinformatics, feature selection in classification of cancer is a primary area of research and utilized to select the most informative genes from thousands of genes in the microarray. Microarray data is generally noisy, is highly redundant, and has an extremely asymmetric dimensionality, as the majority of the genes present here are believed to be uninformative. The paper adopts a methodology of classification of high dimensional lung cancer microarray data utilizing feature selection and optimization techniques. The methodology is divided into two stages; firstly, the ranking of each gene is done based on the standard gene selection techniques like Information Gain, Relief–F test, Chi-square statistic, and T-statistic test. As a result, the gathering of top scored genes is assimilated, and a new feature subset is obtained. In the second stage, the new feature subset is further optimized by using swarm intelligence techniques like Grasshopper Optimization (GO), Moth Flame Optimization (MFO), Bacterial Foraging Optimization (BFO), Krill Herd Optimization (KHO), and Artificial Fish Swarm Optimization (AFSO), and finally, an optimized subset is utilized. The selected genes are used for classification, and the classifiers used here are Naïve Bayesian Classifier (NBC), Decision Trees (DT), Support Vector Machines (SVM), and K-Nearest Neighbour (KNN). The best results are shown when Relief-F test is computed with AFSO and classified with Decision Trees classifier for hundred genes, and the highest classification accuracy of 99.10% is obtained.
Collapse
|
71
|
Racedo S, Portnoy I, Vélez JI, San-Juan-Vergara H, Sanjuan M, Zurek E. A new pipeline for structural characterization and classification of RNA-Seq microbiome data. BioData Min 2021; 14:31. [PMID: 34243809 PMCID: PMC8268467 DOI: 10.1186/s13040-021-00266-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 06/16/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs. RESULTS Here, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method's classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method's performance with that of two state-of-the-art methods. CONCLUSIONS Simulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.
Collapse
Affiliation(s)
| | - Ivan Portnoy
- Universidad del Norte, Barranquilla, Colombia.
- Productivity and Innovation Department, Universidad de la Costa, Calle 58 # 55-56, Barranquilla, Colombia.
| | | | | | | | | |
Collapse
|
72
|
Nakagawa T, Watanabe H, Hyodo M. Kick-one-out-based variable selection method for Euclidean distance-based classifier in high-dimensional settings. J MULTIVARIATE ANAL 2021. [DOI: 10.1016/j.jmva.2021.104756] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
73
|
Song Q, Su J, Zhang W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat Commun 2021; 12:3826. [PMID: 34158507 PMCID: PMC8219725 DOI: 10.1038/s41467-021-24172-y] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 06/07/2021] [Indexed: 12/20/2022] Open
Abstract
Single-cell omics is the fastest-growing type of genomics data in the literature and public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Through benchmarking with other label transfer methods on a total of 30 single cell omics datasets, scGCN consistently demonstrates superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow as a python software, which is available at https://github.com/QSong-github/scGCN .
Collapse
Affiliation(s)
- Qianqian Song
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC, USA
- Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC, USA
| | - Jing Su
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN, USA.
- Section on Gerontology and Geriatric Medicine, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA.
| | - Wei Zhang
- Center for Cancer Genomics and Precision Oncology, Wake Forest Baptist Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston Salem, NC, USA.
- Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, NC, USA.
| |
Collapse
|
74
|
Common features of aging fail to occur in Drosophila raised without a bacterial microbiome. iScience 2021; 24:102703. [PMID: 34235409 PMCID: PMC8246586 DOI: 10.1016/j.isci.2021.102703] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 04/30/2021] [Accepted: 06/07/2021] [Indexed: 02/07/2023] Open
Abstract
Lifespan is limited both by intrinsic decline in vigor with age and by accumulation of external insults. There exists a general picture of the deficits of aging, one that is reflected in a pattern of age-correlated changes in gene expression conserved across species. Here, however, by comparing gene expression profiling of Drosophila raised either conventionally, or free of bacteria, we show that ∼70% of these conserved, age-associated changes in gene expression fail to occur in germ-free flies. Among the processes that fail to show time-dependent change under germ-free conditions are two aging features that are observed across phylogeny, declining expression of stress response genes and increasing expression of innate immune genes. These comprise adaptive strategies the organism uses to respond to bacteria, rather than being inevitable components of age-dependent decline. Changes in other processes are independent of the microbiome and can serve as autonomous markers of aging of the individual.
Collapse
|
75
|
OSN: Onion-ring support neighbors for correspondence selection. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.01.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
76
|
Abdlaty R, Abbass MA, Awadallah AM. High Precision Monitoring of Radiofrequency Ablation for Liver Using Hyperspectral Imaging. Ann Biomed Eng 2021; 49:2430-2440. [PMID: 34075450 DOI: 10.1007/s10439-021-02797-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 05/17/2021] [Indexed: 02/03/2023]
Abstract
Minimally invasive procedures are achieving better satisfaction for treating liver cancers. Energy-based techniques were studied as prospective alternatives to the gold standard of liver transplantation. Among these techniques, radiofrequency (RF) was investigated for the selective ablation of liver tissue. In addition to optimizing the RF settings for the purpose of overcoming tissue perforation or inadequate ablation, an instrument collecting quantitative data regarding the intraoperative tissue status can aid the treatment procedure. This study demonstrates an innovative noninvasive technique using hyperspectral imaging (HSI) for monitoring RF ablative therapy in ex-vivo liver tissue. The cubic data generated by HSI provides spectral as well as spatial properties of the liver tissue included in each pixel of the field of view. In our study, the applied statistical analysis saves the computational burdens of multivariate analysis techniques. For this purpose, spectral angle mapper, logistic regression algorithm, and principal component analysis were applied. Of all spectral bands captured by the HSI camera, bands centered at 760 and 960 nm were identified for predicting the ablated area. Based on statistical analysis, the threshold for predicting the ablated area of the liver samples was determined, provided that the specificity is kept at 90%.
Collapse
Affiliation(s)
- Ramy Abdlaty
- Department of Biomedical Engineering, Military Technical College, Cairo, Egypt.
| | - Mohamed A Abbass
- Department of Biomedical Engineering, Military Technical College, Cairo, Egypt
| | - Ahmed M Awadallah
- Department of Biomedical Engineering, Military Technical College, Cairo, Egypt
| |
Collapse
|
77
|
Zhou Y, Zhang L, Xu J, Zhang J, Yan X. Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data. Stat Med 2021; 40:4077-4089. [PMID: 34028849 DOI: 10.1002/sim.9015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 02/26/2021] [Accepted: 04/13/2021] [Indexed: 11/08/2022]
Abstract
Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not DE and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes. Widely used schemes in the literature, such as the BSS/WSS (BW) method, assume that data are normally distributed and may not be suitable for bulk and single-cell RNA-seq data. In this article, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The highest gene correlation coefficients are considered feature genes, which are the most effective for classifying bulk and single-cell RNA-seq dataset. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package named "CAEN" to facilitate wide use.
Collapse
Affiliation(s)
- Yan Zhou
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Li Zhang
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Jinfeng Xu
- Department of Mathematics, Hong Kong University, Pokfulam, Hong Kong
| | - Jun Zhang
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Statistical Sciences, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Xiaodong Yan
- Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China
| |
Collapse
|
78
|
Arbet J, Zhuang Y, Litkowski E, Saba L, Kechris K. Comparing Statistical Tests for Differential Network Analysis of Gene Modules. Front Genet 2021; 12:630215. [PMID: 34093641 PMCID: PMC8170128 DOI: 10.3389/fgene.2021.630215] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 04/19/2021] [Indexed: 11/13/2022] Open
Abstract
Genes often work together to perform complex biological processes, and "networks" provide a versatile framework for representing the interactions between multiple genes. Differential network analysis (DiNA) quantifies how this network structure differs between two or more groups/phenotypes (e.g., disease subjects and healthy controls), with the goal of determining whether differences in network structure can help explain differences between phenotypes. In this paper, we focus on gene co-expression networks, although in principle, the methods studied can be used for DiNA for other types of features (e.g., metabolome, epigenome, microbiome, proteome, etc.). Three common applications of DiNA involve (1) testing whether the connections to a single gene differ between groups, (2) testing whether the connection between a pair of genes differs between groups, or (3) testing whether the connections within a "module" (a subset of 3 or more genes) differs between groups. This article focuses on the latter, as there is a lack of studies comparing statistical methods for identifying differentially co-expressed modules (DCMs). Through extensive simulations, we compare several previously proposed test statistics and a new p-norm difference test (PND). We demonstrate that the true positive rate of the proposed PND test is competitive with and often higher than the other methods, while controlling the false positive rate. The R package discoMod (differentially co-expressed modules) implements the proposed method and provides a full pipeline for identifying DCMs: clustering tools to derive gene modules, tests to identify DCMs, and methods for visualizing the results.
Collapse
Affiliation(s)
- Jaron Arbet
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Yaxu Zhuang
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Elizabeth Litkowski
- Department of Epidemiology, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| | - Laura Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora CO, United States
| | - Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
| |
Collapse
|
79
|
Nakayama Y. Robust support vector machine for high-dimensional imbalanced data. COMMUN STAT-SIMUL C 2021. [DOI: 10.1080/03610918.2019.1586922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Yugo Nakayama
- Graduate School of Pure and Applied Sciences, University of Tsukuba, Tsukuba-shi, Ibaraki, Japan
| |
Collapse
|
80
|
Auwul MR, Rahman MR, Gov E, Shahjaman M, Moni MA. Bioinformatics and machine learning approach identifies potential drug targets and pathways in COVID-19. Brief Bioinform 2021; 22:6220170. [PMID: 33839760 PMCID: PMC8083354 DOI: 10.1093/bib/bbab120] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 02/15/2021] [Accepted: 03/13/2021] [Indexed: 12/12/2022] Open
Abstract
Current coronavirus disease-2019 (COVID-19) pandemic has caused massive loss of lives. Clinical trials of vaccines and drugs are currently being conducted around the world; however, till now no effective drug is available for COVID-19. Identification of key genes and perturbed pathways in COVID-19 may uncover potential drug targets and biomarkers. We aimed to identify key gene modules and hub targets involved in COVID-19. We have analyzed SARS-CoV-2 infected peripheral blood mononuclear cell (PBMC) transcriptomic data through gene coexpression analysis. We identified 1520 and 1733 differentially expressed genes (DEGs) from the GSE152418 and CRA002390 PBMC datasets, respectively (FDR < 0.05). We found four key gene modules and hub gene signature based on module membership (MMhub) statistics and protein-protein interaction (PPI) networks (PPIhub). Functional annotation by enrichment analysis of the genes of these modules demonstrated immune and inflammatory response biological processes enriched by the DEGs. The pathway analysis revealed the hub genes were enriched with the IL-17 signaling pathway, cytokine-cytokine receptor interaction pathways. Then, we demonstrated the classification performance of hub genes (PLK1, AURKB, AURKA, CDK1, CDC20, KIF11, CCNB1, KIF2C, DTL and CDC6) with accuracy >0.90 suggesting the biomarker potential of the hub genes. The regulatory network analysis showed transcription factors and microRNAs that target these hub genes. Finally, drug-gene interactions analysis suggests amsacrine, BRD-K68548958, naproxol, palbociclib and teniposide as the top-scored repurposed drugs. The identified biomarkers and pathways might be therapeutic targets to the COVID-19.
Collapse
Affiliation(s)
- Md Rabiul Auwul
- School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China
| | - Md Rezanur Rahman
- Department of Biochemistry and Biotechnology, School of Biomedical Science, Khwaja Yunus Ali University, Sirajgonj-6751, Bangladesh
| | - Esra Gov
- Department of Bioengineering, Adana Alparslan Turkes Science and Technology University, Adana-01250, Turkey
| | - Md Shahjaman
- Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh
| | - Mohammad Ali Moni
- WHO Collaborating Centre on eHealth, UNSW Digital Health, School of Public Health and Community Medicine, Faculty of Medicine, UNSW Sydney, Australia
| |
Collapse
|
81
|
Doo W, Kim H. Bayesian variable selection in clustering high-dimensional data via a mixture of finite mixtures. J STAT COMPUT SIM 2021. [DOI: 10.1080/00949655.2021.1902526] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Woojin Doo
- Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
| | - Heeyoung Kim
- Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
| |
Collapse
|
82
|
Hameed SS, Hassan WH, Latiff LA, Muhammadsharif FF. A comparative study of nature-inspired metaheuristic algorithms using a three-phase hybrid approach for gene selection and classification in high-dimensional cancer datasets. Soft comput 2021. [DOI: 10.1007/s00500-021-05726-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
83
|
Albalawi R, Yeap TH, Benyoucef M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Front Artif Intell 2021; 3:42. [PMID: 33733159 PMCID: PMC7861298 DOI: 10.3389/frai.2020.00042] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Accepted: 05/14/2020] [Indexed: 11/21/2022] Open
Abstract
With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.
Collapse
Affiliation(s)
- Rania Albalawi
- School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada
| | - Tet Hin Yeap
- School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, Canada
| | - Morad Benyoucef
- Telfer School of Management, University of Ottawa, Ottawa, ON, Canada
| |
Collapse
|
84
|
Wang J, Zhao Y, Tang LL. Estimating the AUC with a Graphical Lasso Method for High-dimensional Biomarkers with LOD. BIOSTATISTICS & EPIDEMIOLOGY 2021; 5:189-206. [PMID: 35415380 PMCID: PMC9000202 DOI: 10.1080/24709360.2021.1898731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 10/13/2020] [Indexed: 06/14/2023]
Abstract
This manuscript estimates the area under the receiver operating characteristic curve (AUC) of combined biomarkers in a high-dimensional setting. We propose a penalization approach to the inference of precision matrices in the presence of the limit of detection. A new version of expectation-maximization algorithm is then proposed for the penalized likelihood, with the use of numerical integration and the graphical lasso method. The estimated precision matrix is then applied to the inference of AUCs. The proposed method outperforms the existing methods in numerical studies. We apply the proposed method to a data set of brain tumor study. The results show a higher accuracy on the estimation of AUC compared with the existing methods.
Collapse
Affiliation(s)
- Jirui Wang
- Department of Statistics, George Mason University
| | | | | |
Collapse
|
85
|
Wu YT, Zhang CJ, Mol BW, Kawai A, Li C, Chen L, Wang Y, Sheng JZ, Fan JX, Shi Y, Huang HF. Early Prediction of Gestational Diabetes Mellitus in the Chinese Population via Advanced Machine Learning. J Clin Endocrinol Metab 2021; 106:e1191-e1205. [PMID: 33351102 PMCID: PMC7947802 DOI: 10.1210/clinem/dgaa899] [Citation(s) in RCA: 81] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Indexed: 12/29/2022]
Abstract
CONTEXT Accurate methods for early gestational diabetes mellitus (GDM) (during the first trimester of pregnancy) prediction in Chinese and other populations are lacking. OBJECTIVES This work aimed to establish effective models to predict early GDM. METHODS Pregnancy data for 73 variables during the first trimester were extracted from the electronic medical record system. Based on a machine learning (ML)-driven feature selection method, 17 variables were selected for early GDM prediction. To facilitate clinical application, 7 variables were selected from the 17-variable panel. Advanced ML approaches were then employed using the 7-variable data set and the 73-variable data set to build models predicting early GDM for different situations, respectively. RESULTS A total of 16 819 and 14 992 cases were included in the training and testing sets, respectively. Using 73 variables, the deep neural network model achieved high discriminative power, with area under the curve (AUC) values of 0.80. The 7-variable logistic regression (LR) model also achieved effective discriminate power (AUC = 0.77). Low body mass index (BMI) (≤ 17) was related to an increased risk of GDM, compared to a BMI in the range of 17 to 18 (minimum risk interval) (11.8% vs 8.7%, P = .09). Total 3,3,5'-triiodothyronine (T3) and total thyroxin (T4) were superior to free T3 and free T4 in predicting GDM. Lipoprotein(a) was demonstrated a promising predictive value (AUC = 0.66). CONCLUSIONS We employed ML models that achieved high accuracy in predicting GDM in early pregnancy. A clinically cost-effective 7-variable LR model was simultaneously developed. The relationship of GDM with thyroxine and BMI was investigated in the Chinese population.
Collapse
Affiliation(s)
- Yan-Ting Wu
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Chen-Jie Zhang
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Ben Willem Mol
- Department of Obstetrics and Gynecology, Monash University, Clayton, Australia
| | - Andrew Kawai
- Department of Obstetrics and Gynecology, Monash University, Clayton, Australia
| | - Cheng Li
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Lei Chen
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Yu Wang
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Jian-Zhong Sheng
- Department of Pathology and Pathophysiology, School of Medicine, Zhejiang University, Zhejiang, China
| | - Jian-Xia Fan
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
| | - Yi Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, Shanghai, China
- Correspondence: He-Feng Huang, MD, International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, No. 910, Hengshan Rd, Shanghai, 200030, China. ; or Yi Shi, PhD, Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, 1954 Huashan Rd, Shanghai 200030, China.
| | - He-Feng Huang
- International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Embryo Original Diseases, Shanghai, China
- Research Units of Embryo Original Diseases, Chinese Academy of Medical Sciences, Shanghai, China
- Correspondence: He-Feng Huang, MD, International Peace Maternity and Child Health Hospital, School of Medicine, Shanghai Jiao Tong University, No. 910, Hengshan Rd, Shanghai, 200030, China. ; or Yi Shi, PhD, Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, 1954 Huashan Rd, Shanghai 200030, China.
| |
Collapse
|
86
|
Zhu J, Yuan Z, Shu L, Liao W, Zhao M, Zhou Y. Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data. Front Genet 2021; 12:642227. [PMID: 33747051 PMCID: PMC7969809 DOI: 10.3389/fgene.2021.642227] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 02/01/2021] [Indexed: 12/13/2022] Open
Abstract
Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA.
Collapse
Affiliation(s)
- Jiadi Zhu
- Department of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ziyang Yuan
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen, China
| | - Lianjie Shu
- Faculty of Business Administration, University of Macau, Macau, China
| | - Wenhui Liao
- GuangDong University of Finance, Guangzhou, China
| | - Mingtao Zhao
- Institute of Statistics and Applied Mathematics, Anhui University of Finance and Economics, Bengbu, China
| | - Yan Zhou
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen, China
| |
Collapse
|
87
|
Das J, Barman Mandal S. Classification of Homo sapiens gene behavior using linear discriminant analysis fused with minimum entropy mapping. Med Biol Eng Comput 2021; 59:673-691. [PMID: 33595791 DOI: 10.1007/s11517-021-02324-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 01/18/2021] [Indexed: 11/25/2022]
Abstract
Classification of Homo sapiens gene behavior employing computational biology is a recent research trend. But monitoring gene activity profile and genetic behavior from the alphabetic DNA sequence using a non-invasive method is a tremendous challenge in functional genomics. The present paper addresses such issue and attempts to differentiate Homo sapiens genes using linear discriminant analysis (LDA) method. Annotated protein coding sequences of Homo sapiens genes, collected from NCBI, are taken as test samples. Minimum entropy-based mapping (MEM) technique assists to extract highest information from the numerical DNA sequences. The proposed LDA technique has successfully classified Homo sapiens genes based on the following features: composition of hydrophilic amino acids, dominance of arginine amino acid, and magnitude and size of individual amino acids. The proposed algorithm is successfully tested on 84 Homo sapiens healthy and cancer genes of the prostate and breast cells. Classification performance of the proposed LDA technique is judged by sensitivity (89.12%), specificity (91.9%), accuracy (90.87%), F1 score (92.03%), Matthews' correlation coefficients (81.04%), and miss rate (9.12%), and it outperforms other four existing classifiers. The results are cross-validated through Rayleigh PDF and mutual information technique. Fisher test, 2-sample T-test, and relative entropy test are considered to verify the efficacy of the present classifier.
Collapse
Affiliation(s)
- Joyshri Das
- Institute of Radio Physics & Electronics, University of Calcutta, Kolkata, India
| | - Soma Barman Mandal
- Institute of Radio Physics & Electronics, University of Calcutta, Kolkata, India
| |
Collapse
|
88
|
Risso D, Pagnotta SM. Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles. Bioinformatics 2021; 37:2356-2364. [PMID: 33560368 PMCID: PMC8388024 DOI: 10.1093/bioinformatics/btab091] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 01/27/2021] [Accepted: 02/05/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Data transformations are an important step in the analysis of RNA-seq data. Nonetheless, the impact of transformation on the outcome of unsupervised clustering procedures is still unclear. RESULTS Here, we present an Asymmetric Winsorization per Sample Transformation (AWST), which is robust to data perturbations and removes the need for selecting the most informative genes prior to sample clustering. Our procedure leads to robust and biologically meaningful clusters both in bulk and in single-cell applications. AVAILABILITY The AWST method is available at https://github.com/drisso/awst. The code to reproduce the analyses is available at https://github.com/drisso/awst\_analysis. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Davide Risso
- Dept. of Statistical Sciences, Università degli Studi di Padova, Padova, Italy
| | | |
Collapse
|
89
|
Development of a Safety Management System Tracking the Weight of Heavy Objects Carried by Construction Workers Using FSR Sensors. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11041378] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
It has been pointed out that the act of carrying a heavy object that exceeds a certain weight by a worker at a construction site is a major factor that puts physical burden on the worker’s musculoskeletal system. However, due to the nature of the construction site, where there are a large number of workers simultaneously working in an irregular space, it is difficult to figure out the weight of the object carried by the worker in real time or keep track of the worker who carries the excess weight. This paper proposes a prototype system to track the weight of heavy objects carried by construction workers by developing smart safety shoes with FSR (Force Sensitive Resistor) sensors. The system consists of smart safety shoes with sensors attached, a mobile device for collecting initial sensing data, and a web-based server computer for storing, preprocessing and analyzing such data. The effectiveness and accuracy of the weight tracking system was verified through the experiments where a weight was lifted by each experimenter from +0 kg to +20 kg in 5 kg increments. The results of the experiment were analyzed by a newly developed machine learning based model, which adopts effective classification algorithms such as decision tree, random forest, gradient boosting algorithm (GBM), and light GBM. The average accuracy classifying the weight by each classification algorithm showed similar, but high accuracy in the following order: random forest (90.9%), light GBM (90.5%), decision tree (90.3%), and GBM (89%). Overall, the proposed weight tracking system has a significant 90.2% average accuracy in classifying how much weight each experimenter carries.
Collapse
|
90
|
Ye T, Li S, Zhang Y. Genomic pan-cancer classification using image-based deep learning. Comput Struct Biotechnol J 2021; 19:835-846. [PMID: 33598099 PMCID: PMC7848437 DOI: 10.1016/j.csbj.2021.01.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 01/05/2021] [Accepted: 01/08/2021] [Indexed: 12/24/2022] Open
Abstract
Accurate cancer type classification based on genetic mutation can significantly facilitate cancer-related diagnosis. However, existing methods usually use feature selection combined with simple classifiers to quantify key mutated genes, resulting in poor classification performance. To circumvent this problem, a novel image-based deep learning strategy is employed to distinguish different types of cancer. Unlike conventional methods, we first convert gene mutation data containing single nucleotide polymorphisms, insertions and deletions into a genetic mutation map, and then apply the deep learning networks to classify different cancer types based on the mutation map. We outline these methods and present results obtained in training VGG-16, Inception-v3, ResNet-50 and Inception-ResNet-v2 neural networks to classify 36 types of cancer from 9047 patient samples. Our approach achieves overall higher accuracy (over 95%) compared with other widely adopted classification methods. Furthermore, we demonstrate the application of a Guided Grad-CAM visualization to generate heatmaps and identify the top-ranked tumor-type-specific genes and pathways. Experimental results on prostate and breast cancer demonstrate our method can be applied to various types of cancer. Powered by the deep learning, this approach can potentially provide a new solution for pan-cancer classification and cancer driver gene discovery. The source code and datasets supporting the study is available at https://github.com/yetaoyu/Genomic-pan-cancer-classification.
Collapse
Affiliation(s)
- Taoyu Ye
- Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055, China
| | - Sen Li
- Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055, China
| | - Yang Zhang
- Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, 518055, China
| |
Collapse
|
91
|
Alharthi AM, Lee MH, Algamal ZY. Gene selection and classification of microarray gene expression data based on a new adaptive L1-norm elastic net penalty. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100622] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
92
|
LogSum + L 2 penalized logistic regression model for biomarker selection and cancer classification. Sci Rep 2020; 10:22125. [PMID: 33335163 PMCID: PMC7747646 DOI: 10.1038/s41598-020-79028-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 11/25/2020] [Indexed: 12/11/2022] Open
Abstract
Biomarker selection and cancer classification play an important role in knowledge discovery using genomic data. Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. In this paper, we proposed a LogSum + L2 penalized logistic regression model, and furthermore used a coordinate decent algorithm to solve it. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. Our proposed model achieves the excellent performance in group feature selection and classification problems.
Collapse
|
93
|
Ghosh SK, Ghosh A. A Novel Human Diabetes Biomarker Recognition Approach Using Fuzzy Rough Multigranulation Nearest Neighbour Classifier Model. Interdiscip Sci 2020; 12:461-475. [PMID: 32920773 DOI: 10.1007/s12539-020-00391-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Revised: 08/22/2020] [Accepted: 08/31/2020] [Indexed: 10/23/2022]
Abstract
The selection of gene identifier from microarray databases is a challenging task since microarray contains large number of gene attributes for a few samples. This article proposes a novel fuzzy-rough set-based gene expression features selection using fuzzy-rough reduct under multi-granular space for human diabetes patient. Firstly, fuzzy multi-granular gain has been computed from the expression datasets via fuzzy entropy which reduces the dimension of the database. Thereafter, the features have been selected from microarray using the fuzzy rough reduct and information gain with respect to their expression patterns. To reduce the computational cost, a decision making scheme has been designed using a rough approximation of a fuzzy concept in the field of multi-granulation framework. Finally, we have recognized the association among the genomes that have expressively different expression patterns from controlled state to the diabetic state with respect to their impression using modified fuzzy-rough nearest neighbour classifier (FRNNC). Five standard diabetic microarray datasets have been considered to quantify the efficiency of the designed FRNNC model and are validated with F measure using diabetes gene expression NCBI database and it performs superior compared to existing methods.
Collapse
Affiliation(s)
- Swarup Kr Ghosh
- Department of Computer Science and Engineering, Sister Nivedita University, Kolkata, India.
| | - Anupam Ghosh
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| |
Collapse
|
94
|
Zhong Y, Chalise P, He J. Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2020.1850790] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Yi Zhong
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| | - Prabhakar Chalise
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| | - Jianghua He
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
95
|
Nam JH, Kim D, Chung D. Sparse Linear Discriminant Analysis using the Prior-Knowledge-Guided Block Covariance Matrix. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2020; 206:104142. [PMID: 32968333 PMCID: PMC7505231 DOI: 10.1016/j.chemolab.2020.104142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
There are two key challenges when using a linear discriminant analysis in the high-dimensional setting, including singularity of the covariance matrix and difficulty of interpreting the resulting classifier. Although several methods have been proposed to address these problems, they focused only on identifying a parsimonious set of variables maximizing classification accuracy. However, most methods did not consider dependency between variables and efficacy of selected variables appropriately. To address these limitations, here we propose a new approach that directly estimates the sparse discriminant vector without a need of estimating the whole inverse covariance matrix, by formulating a quadratic optimization problem. Furthermore, this approach also allows to integrate external information to guide the structure of covariance matrix. We evaluated the proposed model with simulation studies. We then applied it to the transcriptomic study that aims to identify genomic markers predictive of the response to cancer immunotherapy, where the covariance matrix was constructed based on the prior knowledge available in the pathway database.
Collapse
Affiliation(s)
- Jin Hyun Nam
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29412, United States of America
- School of Pharmacy, Sungkyunkwan University, Suwon, Republic of Korea
| | - Donguk Kim
- Department of Statistics, Sungkyunkwan University, Seoul, Republic of Korea
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 43210, United States of America
| |
Collapse
|
96
|
Rafique O, Mir AH. Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data. J Biomed Inform 2020; 112:103620. [PMID: 33188907 DOI: 10.1016/j.jbi.2020.103620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 10/23/2020] [Accepted: 11/04/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND The heterogeneous nature of cancer necessitates subtyping of cancer patients into distinct and well separated subgroups. However, computational issues arise because gene expression data is noisy and contains outliers apart from being high dimensional. As such, an attempt to subtype cancer patients from gene expression data leads to highly overlapping Kaplan-Meier (KM) survival plots and thus clear distinction among the discovered subtypes becomes difficult. Here we attempt to achieve a greater separation among the subtypes through a robust clustering pipeline. METHODS We propose a robust framework to achieve a better separation among the discovered subtypes. Our framework is based on dimensionality reduction of a weighted gene expression matrix using t-distributed Stochastic Neighbor Embedding (t-SNE) and a robust Gaussian mixture model based clustering approach. Every gene is weighted according to the median absolute deviation (MAD) of the gene before dimensionality reduction. The results are quantified by measuring the minimum pairwise separation among the KM plots and minimum hazard ratio among the subtypes. We also introduce a novel method, called cumulative survival separation, to quantify the separation among the discovered subtypes. RESULTS To validate the proposed methodology we obtained five cancer gene expression datasets from The Cancer Genome Atlas (TCGA) and comparisons with Consensus Clustering (CC), Consensus non-negative matrix factorization (CNMF), fast density-aware spectral clustering (Spectrum) and Neighborhood based Multi-Omics clustering (NEMO) methodologies show that the proposed method is able to achieve a greater separation compared to the aforementioned methods in literature. For instance, the minimum pairwise life expectancy difference (in days) between the discovered subtypes for GBM is 61 days for the proposed methodology with MAD scores, whereas it is approximately 33, 19, 49 and 33 days only for CC, Spectrum, Nemo and CNMF respectively. Comparisons are also shown for the proposed framework with and without using the MAD scores and it is observed that MAD score significantly improves the subtype separation. Hazard ratio analysis also shows that the proposed methodology performs better. Furthermore, pathway over-representation analyses were carried to identify relevant genetic pathways which can be possible targets for treatment. CONCLUSION The results suggest that the use of median absolute deviation and a robust clustering methodology are helpful in achieving greater separation among the subtypes with better statistical and clinical significance.
Collapse
Affiliation(s)
- Omar Rafique
- Machine Learning Lab, Department of Electronics and Communication Engineering, National Institute of Technology, Srinagar, JK, India.
| | - A H Mir
- Machine Learning Lab, Department of Electronics and Communication Engineering, National Institute of Technology, Srinagar, JK, India
| |
Collapse
|
97
|
Iliopoulos A, Beis G, Apostolou P, Papasotiriou I. Complex Networks, Gene Expression and Cancer Complexity: A Brief Review of Methodology and Applications. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017093504] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
In this brief survey, various aspects of cancer complexity and how this complexity can
be confronted using modern complex networks’ theory and gene expression datasets, are described.
In particular, the causes and the basic features of cancer complexity, as well as the challenges
it brought are underlined, while the importance of gene expression data in cancer research
and in reverse engineering of gene co-expression networks is highlighted. In addition, an introduction
to the corresponding theoretical and mathematical framework of graph theory and complex
networks is provided. The basics of network reconstruction along with the limitations of gene
network inference, the enrichment and survival analysis, evolution, robustness-resilience and cascades
in complex networks, are described. Finally, an indicative and suggestive example of a cancer
gene co-expression network inference and analysis is given.
Collapse
Affiliation(s)
- A.C. Iliopoulos
- Research and Development Department, Research Genetic Cancer Centre S.A., Florina, Greece
| | - G. Beis
- Research and Development Department, Research Genetic Cancer Centre S.A., Florina, Greece
| | - P. Apostolou
- Research and Development Department, Research Genetic Cancer Centre S.A., Florina, Greece
| | - I. Papasotiriou
- Research Genetic Cancer Centre International GmbH, Zug, Switzerland
| |
Collapse
|
98
|
Lung PY, Zhong D, Pang X, Li Y, Zhang J. Maximizing the reusability of gene expression data by predicting missing metadata. PLoS Comput Biol 2020; 16:e1007450. [PMID: 33156882 PMCID: PMC7673503 DOI: 10.1371/journal.pcbi.1007450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Revised: 11/18/2020] [Accepted: 10/09/2020] [Indexed: 11/18/2022] Open
Abstract
Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, United States of America
| | - Xiaodong Pang
- Insilicom LLC, Tallahassee, United States of America
| | - Yan Li
- Department of Breast Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, United States of America
- * E-mail:
| |
Collapse
|
99
|
Yuan B, Yang D, Rothberg BEG, Chang H, Xu T. Unsupervised and supervised learning with neural network for human transcriptome analysis and cancer diagnosis. Sci Rep 2020; 10:19106. [PMID: 33154423 PMCID: PMC7644700 DOI: 10.1038/s41598-020-75715-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 10/15/2020] [Indexed: 11/09/2022] Open
Abstract
Deep learning analysis of images and text unfolds new horizons in medicine. However, analysis of transcriptomic data, the cause of biological and pathological changes, is hampered by structural complexity distinctive from images and text. Here we conduct unsupervised training on more than 20,000 human normal and tumor transcriptomic data and show that the resulting Deep-Autoencoder, DeepT2Vec, has successfully extracted informative features and embedded transcriptomes into 30-dimensional Transcriptomic Feature Vectors (TFVs). We demonstrate that the TFVs could recapitulate expression patterns and be used to track tissue origins. Trained on these extracted features only, a supervised classifier, DeepC, can effectively distinguish tumors from normal samples with an accuracy of 90% for Pan-Cancer and reach an average 94% for specific cancers. Training on a connected network, the accuracy is further increased to 96% for Pan-Cancer. Together, our study shows that deep learning with autoencoder is suitable for transcriptomic analysis, and DeepT2Vec could be successfully applied to distinguish cancers, normal tissues, and other potential traits with limited samples.
Collapse
Affiliation(s)
- Bo Yuan
- Department of Genetics, Yale Cancer Center, Howard Hughes Medical Institute, Yale University School of Medicine, 295 Congress Avenue, New Haven, CT, 06510, USA.,Zhiyuan College, Shanghai Jiao Tong University, Shanghai, China.,Deptartment of Cell Biology, Harvard Medical School, Boston, MA, 02138, USA
| | - Dong Yang
- Westlake Institute for Advanced Study, Westlake University, Hangzhou, China. .,Department of Genetics, Yale Cancer Center, Howard Hughes Medical Institute, Yale University School of Medicine, 295 Congress Avenue, New Haven, CT, 06510, USA.
| | - Bonnie E G Rothberg
- Medical Oncology, Department of Internal Medicine, Yale Cancer Center, Yale University School of Medicine, New Haven, USA
| | - Hao Chang
- Department of Genetics, Yale Cancer Center, Howard Hughes Medical Institute, Yale University School of Medicine, 295 Congress Avenue, New Haven, CT, 06510, USA
| | - Tian Xu
- Westlake Institute for Advanced Study, Westlake University, Hangzhou, China. .,Department of Genetics, Yale Cancer Center, Howard Hughes Medical Institute, Yale University School of Medicine, 295 Congress Avenue, New Haven, CT, 06510, USA.
| |
Collapse
|
100
|
Affine-transformation invariant clustering models. JOURNAL OF STATISTICAL DISTRIBUTIONS AND APPLICATIONS 2020. [DOI: 10.1186/s40488-020-00111-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Abstract
We develop a cluster process which is invariant with respect to unknown affine transformations of the feature space without knowing the number of clusters in advance. Specifically, our proposed method can identify clusters invariant under (I) orthogonal transformations, (II) scaling-coordinate orthogonal transformations, and (III) arbitrary nonsingular linear transformations corresponding to models I, II, and III, respectively and represent clusters with the proposed heatmap of the similarity matrix. The proposed Metropolis-Hasting algorithm leads to an irreducible and aperiodic Markov chain, which is also efficient at identifying clusters reasonably well for various applications. Both the synthetic and real data examples show that the proposed method could be widely applied in many fields, especially for finding the number of clusters and identifying clusters of samples of interest in aerial photography and genomic data.
Collapse
|