151
|
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell 2019; 177:1888-1902.e21. [PMID: 31178118 PMCID: PMC6687398 DOI: 10.1016/j.cell.2019.05.031] [Citation(s) in RCA: 8922] [Impact Index Per Article: 1487.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Revised: 02/14/2019] [Accepted: 05/15/2019] [Indexed: 11/25/2022]
Abstract
Single-cell transcriptomics has transformed our ability to characterize cell states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets to better understand cellular identity and function. Here, we develop a strategy to "anchor" diverse datasets together, enabling us to integrate single-cell measurements not only across scRNA-seq technologies, but also across different modalities. After demonstrating improvement over existing methods for integrating scRNA-seq data, we anchor scRNA-seq experiments with scATAC-seq to explore chromatin differences in closely related interneuron subsets and project protein expression measurements onto a bone marrow atlas to characterize lymphocyte populations. Lastly, we harmonize in situ gene expression and scRNA-seq datasets, allowing transcriptome-wide imputation of spatial gene expression patterns. Our work presents a strategy for the assembly of harmonized references and transfer of information across datasets.
Collapse
Affiliation(s)
- Tim Stuart
- New York Genome Center, New York, NY, USA
| | - Andrew Butler
- New York Genome Center, New York, NY, USA; Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | | | | | - Efthymia Papalexi
- New York Genome Center, New York, NY, USA; Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - William M Mauck
- New York Genome Center, New York, NY, USA; Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Yuhan Hao
- New York Genome Center, New York, NY, USA; Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Marlon Stoeckius
- Technology Innovation Lab, New York Genome Center, New York, NY, USA
| | - Peter Smibert
- Technology Innovation Lab, New York Genome Center, New York, NY, USA
| | - Rahul Satija
- New York Genome Center, New York, NY, USA; Center for Genomics and Systems Biology, New York University, New York, NY, USA.
| |
Collapse
|
152
|
Scrucca L, Serafini A. Projection Pursuit Based on Gaussian Mixtures and Evolutionary Algorithms. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1598871] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Luca Scrucca
- Department of Economics, Università degli Studi di Perugia, Perugia, Italy
| | - Alessio Serafini
- Department of Economics, Università degli Studi di Perugia, Perugia, Italy
| |
Collapse
|
153
|
Yue L, Li G, Lian H, Wan X. Regression adjustment for treatment effect with multicollinearity in high dimensions. Comput Stat Data Anal 2019. [DOI: 10.1016/j.csda.2018.11.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
154
|
Analytical Validation of Multiplex Biomarker Assay to Stratify Colorectal Cancer into Molecular Subtypes. Sci Rep 2019; 9:7665. [PMID: 31113981 PMCID: PMC6529539 DOI: 10.1038/s41598-019-43492-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Accepted: 04/23/2019] [Indexed: 01/09/2023] Open
Abstract
Previously, we classified colorectal cancers (CRCs) into five CRCAssigner (CRCA) subtypes with different prognoses and potential treatment responses, later consolidated into four consensus molecular subtypes (CMS). Here we demonstrate the analytical development and validation of a custom NanoString nCounter platform-based biomarker assay (NanoCRCA) to stratify CRCs into subtypes. To reduce costs, we switched from the standard nCounter protocol to a custom modified protocol. The assay included a reduced 38-gene panel that was selected using an in-house machine-learning pipeline. We applied NanoCRCA to 413 samples from 355 CRC patients. From the fresh frozen samples (n = 237), a subset had matched microarray/RNAseq profiles (n = 47) or formalin-fixed paraffin-embedded (FFPE) samples (n = 58). We also analyzed a further 118 FFPE samples. We compared the assay results with the CMS classifier, different platforms (microarrays/RNAseq) and gene-set classifiers (38 and the original 786 genes). The standard and modified protocols showed high correlation (> 0.88) for gene expression. Technical replicates were highly correlated (> 0.96). NanoCRCA classified fresh frozen and FFPE samples into all five CRCA subtypes with consistent classification of selected matched fresh frozen/FFPE samples. We demonstrate high and significant subtype concordance across protocols (100%), gene sets (95%), platforms (87%) and with CMS subtypes (75%) when evaluated across multiple datasets. Overall, our NanoCRCA assay with further validation may facilitate prospective validation of CRC subtypes in clinical trials and beyond.
Collapse
|
155
|
Singh V, Verma NK, Cui Y. Type-2 Fuzzy PCA Approach in Extracting Salient Features for Molecular Cancer Diagnostics and Prognostics. IEEE Trans Nanobioscience 2019; 18:482-489. [PMID: 31107656 DOI: 10.1109/tnb.2019.2917814] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine learning is becoming a powerful tool for cancer diagnosis and prognosis based on classification using high dimensional molecular data. However, extracting classification features from high-dimensional datasets remains a challenging problem. Principal component analysis (PCA) is a widely used method for dimensionality reduction. However, it is well-known that PCA and most PCA-based feature extraction methods are sensitive to noise, which may affect the accuracy of the subsequent classification. To address this problem, here we have proposed a robust fuzzy principal component analysis (PCA) with interval type-2 (IT-2) fuzzy membership functions for feature extraction. We have tested the performance of three widely used classifiers using the features extracted by proposed approaches and other feature extraction methods - PCA-based feature extraction methods (i.e. conventional PCA and fuzzy PCA), linear discriminant analysis (LDA), and support vector machine recursive feature elimination (SVM-RFE). The proposed feature extraction approaches showed better performance on cancer transcriptome and proteome datasets.
Collapse
|
156
|
Borcherding N, Voigt AP, Liu V, Link BK, Zhang W, Jabbari A. Single-Cell Profiling of Cutaneous T-Cell Lymphoma Reveals Underlying Heterogeneity Associated with Disease Progression. Clin Cancer Res 2019; 25:2996-3005. [PMID: 30718356 PMCID: PMC6659117 DOI: 10.1158/1078-0432.ccr-18-3309] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Revised: 12/07/2018] [Accepted: 01/25/2019] [Indexed: 11/16/2022]
Abstract
PURPOSE Cutaneous T-cell lymphomas (CTCL), encompassing a spectrum of T-cell lymphoproliferative disorders involving the skin, have collectively increased in incidence over the last 40 years. Sézary syndrome is an aggressive form of CTCL characterized by significant presence of malignant cells in both the blood and skin. The guarded prognosis for Sézary syndrome reflects a lack of reliably effective therapy, due, in part, to an incomplete understanding of disease pathogenesis. EXPERIMENTAL DESIGN Using single-cell sequencing of RNA and the machine-learning reverse graph embedding approach in the Monocle package, we defined a model featuring distinct transcriptomic states within Sézary syndrome. Gene expression used to differentiate the unique transcriptional states were further used to develop a boosted tree classification for early versus late CTCL disease. RESULTS Our analysis showed the involvement of FOXP3 + malignant T cells during clonal evolution, transitioning from FOXP3 + T cells to GATA3 + or IKZF2 + (HELIOS) tumor cells. Transcriptomic diversities in a clonal tumor can be used to predict disease stage, and we were able to characterize a gene signature that predicts disease stage with close to 80% accuracy. FOXP3 was found to be the most important factor to predict early disease in CTCL, along with another 19 genes used to predict CTCL stage. CONCLUSIONS This work offers insight into the heterogeneity of Sézary syndrome, providing better understanding of the transcriptomic diversities within a clonal tumor. This transcriptional heterogeneity can predict tumor stage and thereby offer guidance for therapy.
Collapse
Affiliation(s)
- Nicholas Borcherding
- Department of Pathology, University of Iowa, College of Medicine, Iowa City, Iowa
- Cancer Biology Graduate Program, University of Iowa, College of Medicine, Iowa City, Iowa
- Medical Scientist Training Program, University of Iowa, College of Medicine, Iowa City, Iowa
- Holden Comprehensive Cancer Center, University of Iowa, College of Medicine, Iowa City, Iowa
| | - Andrew P Voigt
- Medical Scientist Training Program, University of Iowa, College of Medicine, Iowa City, Iowa
| | - Vincent Liu
- Department of Pathology, University of Iowa, College of Medicine, Iowa City, Iowa
- Holden Comprehensive Cancer Center, University of Iowa, College of Medicine, Iowa City, Iowa
- Department of Dermatology, University of Iowa, College of Medicine, Iowa City, Iowa
| | - Brian K Link
- Holden Comprehensive Cancer Center, University of Iowa, College of Medicine, Iowa City, Iowa
- Department of Internal Medicine, University of Iowa, College of Medicine, Iowa City, Iowa
| | - Weizhou Zhang
- Department of Pathology, University of Iowa, College of Medicine, Iowa City, Iowa
- Cancer Biology Graduate Program, University of Iowa, College of Medicine, Iowa City, Iowa
- Medical Scientist Training Program, University of Iowa, College of Medicine, Iowa City, Iowa
- Holden Comprehensive Cancer Center, University of Iowa, College of Medicine, Iowa City, Iowa
- Interdisciplinary Program in Immunology, University of Iowa, College of Medicine, Iowa City, Iowa
| | - Ali Jabbari
- Cancer Biology Graduate Program, University of Iowa, College of Medicine, Iowa City, Iowa.
- Medical Scientist Training Program, University of Iowa, College of Medicine, Iowa City, Iowa
- Holden Comprehensive Cancer Center, University of Iowa, College of Medicine, Iowa City, Iowa
- Department of Dermatology, University of Iowa, College of Medicine, Iowa City, Iowa
- Interdisciplinary Program in Immunology, University of Iowa, College of Medicine, Iowa City, Iowa
| |
Collapse
|
157
|
|
158
|
Zhou Y, Wan X, Zhang B, Tong T. Classifying next-generation sequencing data using a zero-inflated Poisson model. Bioinformatics 2019; 34:1329-1335. [PMID: 29186294 DOI: 10.1093/bioinformatics/btx768] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 11/24/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. Results In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors. Availability and implementation The software is available at http://www.math.hkbu.edu.hk/∼tongt. Contact xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Zhou
- College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen 518060, China
| | - Xiang Wan
- Department of Computer Science, and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| | - Baoxue Zhang
- School of Statistics, Capital University of Economics and Business, Beijing 100070, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| |
Collapse
|
159
|
Braman N, Prasanna P, Whitney J, Singh S, Beig N, Etesami M, Bates DDB, Gallagher K, Bloch BN, Vulchi M, Turk P, Bera K, Abraham J, Sikov WM, Somlo G, Harris LN, Gilmore H, Plecha D, Varadan V, Madabhushi A. Association of Peritumoral Radiomics With Tumor Biology and Pathologic Response to Preoperative Targeted Therapy for HER2 (ERBB2)-Positive Breast Cancer. JAMA Netw Open 2019; 2:e192561. [PMID: 31002322 PMCID: PMC6481453 DOI: 10.1001/jamanetworkopen.2019.2561] [Citation(s) in RCA: 218] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
IMPORTANCE There has been significant recent interest in understanding the utility of quantitative imaging to delineate breast cancer intrinsic biological factors and therapeutic response. No clinically accepted biomarkers are as yet available for estimation of response to human epidermal growth factor receptor 2 (currently known as ERBB2, but referred to as HER2 in this study)-targeted therapy in breast cancer. OBJECTIVE To determine whether imaging signatures on clinical breast magnetic resonance imaging (MRI) could noninvasively characterize HER2-positive tumor biological factors and estimate response to HER2-targeted neoadjuvant therapy. DESIGN, SETTING, AND PARTICIPANTS In a retrospective diagnostic study encompassing 209 patients with breast cancer, textural imaging features extracted within the tumor and annular peritumoral tissue regions on MRI were examined as a means to identify increasingly granular breast cancer subgroups relevant to therapeutic approach and response. First, among a cohort of 117 patients who received an MRI prior to neoadjuvant chemotherapy (NAC) at a single institution from April 27, 2012, through September 4, 2015, imaging features that distinguished HER2+ tumors from other receptor subtypes were identified. Next, among a cohort of 42 patients with HER2+ breast cancers with available MRI and RNaseq data accumulated from a multicenter, preoperative clinical trial (BrUOG 211B), a signature of the response-associated HER2-enriched (HER2-E) molecular subtype within HER2+ tumors (n = 42) was identified. The association of this signature with pathologic complete response was explored in 2 patient cohorts from different institutions, where all patients received HER2-targeted NAC (n = 28, n = 50). Finally, the association between significant peritumoral features and lymphocyte distribution was explored in patients within the BrUOG 211B trial who had corresponding biopsy hematoxylin-eosin-stained slide images. Data analysis was conducted from January 15, 2017, to February 14, 2019. MAIN OUTCOMES AND MEASURES Evaluation of imaging signatures by the area under the receiver operating characteristic curve (AUC) in identifying HER2+ molecular subtypes and distinguishing pathologic complete response (ypT0/is) to NAC with HER2-targeting. RESULTS In the 209 patients included (mean [SD] age, 51.1 [11.7] years), features from the peritumoral regions better discriminated HER2-E tumors (maximum AUC, 0.85; 95% CI, 0.79-0.90; 9-12 mm from the tumor) compared with intratumoral features (AUC, 0.76; 95% CI, 0.69-0.84). A classifier combining peritumoral and intratumoral features identified the HER2-E subtype (AUC, 0.89; 95% CI, 0.84-0.93) and was significantly associated with response to HER2-targeted therapy in both validation cohorts (AUC, 0.80; 95% CI, 0.61-0.98 and AUC, 0.69; 95% CI, 0.53-0.84). Features from the 0- to 3-mm peritumoral region were significantly associated with the density of tumor-infiltrating lymphocytes (R2 = 0.57; 95% CI, 0.39-0.75; P = .002). CONCLUSIONS AND RELEVANCE A combination of peritumoral and intratumoral characteristics appears to identify intrinsic molecular subtypes of HER2+ breast cancers from imaging, offering insights into immune response within the peritumoral environment and suggesting potential benefit for treatment guidance.
Collapse
Affiliation(s)
- Nathaniel Braman
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
| | - Prateek Prasanna
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
| | - Jon Whitney
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
| | - Salendra Singh
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
| | - Niha Beig
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
| | - Maryam Etesami
- Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, Connecticut
| | - David D. B. Bates
- Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Katherine Gallagher
- Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - B. Nicolas Bloch
- Department of Radiology, Boston Medical Center, Boston, Massachusetts
- Department of Radiology, Boston University School of Medicine, Boston, Massachusetts
| | - Manasa Vulchi
- Department of Hematology and Medical Oncology, The Cleveland Clinic, Cleveland, Ohio
| | - Paulette Turk
- Department of Diagnostic Radiology, The Cleveland Clinic, Cleveland, Ohio
| | - Kaustav Bera
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
| | - Jame Abraham
- Department of Hematology and Medical Oncology, The Cleveland Clinic, Cleveland, Ohio
| | - William M. Sikov
- Program in Women’s Oncology, Women and Infants Hospital, Warren Alpert Medical School of Brown University, Providence, Rhode Island
| | - George Somlo
- Department of Medical Oncology and Therapeutics Research, City of Hope National Medical Center, Duarte, California
- Department of Hematology and Hematopoietic Cell Transplantation, City of Hope National Medical Center, Duarte, California
| | - Lyndsay N. Harris
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | - Hannah Gilmore
- Department of Pathology, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Donna Plecha
- Department of Radiology, University Hospitals Cleveland Medical Center, Cleveland, Ohio
| | - Vinay Varadan
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, Ohio
- Louis Stokes Cleveland Veterans Administration Medical Center, Cleveland, Ohio
| |
Collapse
|
160
|
He Y, Zhou J, Lin Y, Zhu T. A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data. Comput Biol Chem 2019; 80:121-127. [PMID: 30947070 DOI: 10.1016/j.compbiolchem.2019.03.017] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 03/23/2019] [Indexed: 11/25/2022]
Abstract
DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.
Collapse
Affiliation(s)
- Yuanyu He
- College of Information Science and Engineering Hunan University Changsha, China
| | - Junhai Zhou
- College of Information Science and Engineering Hunan University Changsha, China.
| | - Yaping Lin
- College of Information Science and Engineering Hunan University Changsha, China
| | - Tuanfei Zhu
- College of Information Science and Engineering Hunan University Changsha, China
| |
Collapse
|
161
|
Li CN, Shang MQ, Shao YH, Xu Y, Liu LM, Wang Z. Sparse L1-norm two dimensional linear discriminant analysis via the generalized elastic net regularization. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.01.049] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
162
|
Wang Y, Yang S, Zhao J, Du W, Liang Y, Wang C, Zhou F, Tian Y, Ma Q. Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model. Sci Rep 2019; 9:4192. [PMID: 30862804 PMCID: PMC6414665 DOI: 10.1038/s41598-019-40780-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Accepted: 02/19/2019] [Indexed: 12/20/2022] Open
Abstract
Measuring conditional relatedness between a pair of genes is a fundamental technique and still a significant challenge in computational biology. Such relatedness can be assessed by gene expression similarities while suffering high false discovery rates. Meanwhile, other types of features, e.g., prior-knowledge based similarities, is only viable for measuring global relatedness. In this paper, we propose a novel machine learning model, named Multi-Features Relatedness (MFR), for accurately measuring conditional relatedness between a pair of genes by incorporating expression similarities with prior-knowledge based similarities in an assessment criterion. MFR is used to predict gene-gene interactions extracted from the COXPRESdb, KEGG, HPRD, and TRRUST databases by the 10-fold cross validation and test verification, and to identify gene-gene interactions collected from the GeneFriends and DIP databases for further verification. The results show that MFR achieves the highest area under curve (AUC) values for identifying gene-gene interactions in the development, test, and DIP datasets. Specifically, it obtains an improvement of 1.1% on average of precision for detecting gene pairs with both high expression similarities and high prior-knowledge based similarities in all datasets, comparing to other linear models and coexpression analysis methods. Regarding cancer gene networks construction and gene function prediction, MFR also obtains the results with more biological significances and higher average prediction accuracy, than other compared models and methods. A website of the MFR model and relevant datasets can be accessed from http://bmbl.sdstate.edu/MFR.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Jing Zhao
- Population Health Group, Sanford Research, Sioux Falls, SD, 57104, USA.,Department of Internal Medicine, Sanford School of Medicine, University of South Dakota, Sioux Falls, SD, 57105, USA
| | - Wei Du
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Yanchun Liang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Department of Computer Science and Technology, Zhuhai College of Jilin University, Zhuhai, 519041, China
| | - Cankun Wang
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57006, USA
| | - Fengfeng Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Yuan Tian
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China. .,School of Artificial Intelligence, Jilin University, Changchun, 130012, China.
| | - Qin Ma
- Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57006, USA. .,Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
163
|
Evaluation of the classification method using ancestry SNP markers for ethnic group. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2019. [DOI: 10.29220/csam.2019.26.1.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
164
|
Yamada T, Himeno T. Estimation of multivariate 3rd moment for high-dimensional data and its application for testing multivariate normality. Comput Stat 2019. [DOI: 10.1007/s00180-018-00865-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
165
|
Liu C, Wong HS. Structured Penalized Logistic Regression for Gene Selection in Gene Expression Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:312-321. [PMID: 29989970 DOI: 10.1109/tcbb.2017.2767589] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In gene expression data analysis, the problems of cancer classification and gene selection are closely related. Successfully selecting informative genes will significantly improve the classification performance. To identify informative genes from a large number of candidate genes, various methods have been proposed. However, the gene expression data may include some important correlation structures, and some of the genes can be divided into different groups based on their biological pathways. Many existing methods do not take into consideration the exact correlation structure within the data. Therefore, from both the knowledge discovery and biological perspectives, an ideal gene selection method should take this structural information into account. Moreover, the better generalization performance can be obtained by discovering correlation structure within data. In order to discover structure information among data and improve learning performance, we propose a structured penalized logistic regression model which simultaneously performs feature selection and model learning for gene expression data analysis. An efficient coordinate descent algorithm has been developed to optimize the model. The numerical simulation studies demonstrate that our method is able to select the highly correlated features. In addition, the results from real gene expression datasets show that the proposed method performs competitively with respect to previous approaches.
Collapse
|
166
|
Gaynanova I, Wang T. Sparse quadratic classification rules via linear dimension reduction. J MULTIVARIATE ANAL 2019; 169:278-299. [PMID: 31105355 PMCID: PMC6516858 DOI: 10.1016/j.jmva.2018.09.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
We consider the problem of high-dimensional classification between two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, we propose to perform simultaneous variable selection and linear dimension reduction on the original data, with the subsequent application of quadratic discriminant analysis on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework doesn't require the estimation of precision matrices; it scales linearly with the number of measurements, making it especially attractive for the use on high-dimensional datasets. We support the methodology with theoretical guarantees on variable selection consistency, and empirical comparisons with competing approaches. We apply the method to gene expression data of breast cancer patients, and confirm the crucial importance of the ESR1 gene in differentiating estrogen receptor status.
Collapse
Affiliation(s)
- Irina Gaynanova
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA
| | - Tianying Wang
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA
| |
Collapse
|
167
|
Emura T, Matsui S, Chen HY. compound.Cox: Univariate feature selection and compound covariate for predicting survival. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 168:21-37. [PMID: 30527130 DOI: 10.1016/j.cmpb.2018.10.020] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Revised: 09/26/2018] [Accepted: 10/26/2018] [Indexed: 05/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Univariate feature selection is one of the simplest and most commonly used techniques to develop a multigene predictor for survival. Presently, there is no software tailored to perform univariate feature selection and predictor construction. METHODS We develop the compound.Cox R package that implements univariate significance tests (via the Wald tests or score tests) for feature selection. We provide a cross-validation algorithm to measure predictive capability of selected genes and a permutation algorithm to assess the false discovery rate. We also provide three algorithms for constructing a multigene predictor (compound covariate, compound shrinkage, and copula-based methods), which are tailored to the subset of genes obtained from univariate feature selection. We demonstrate our package using survival data on the lung cancer patients. We examine the predictive capability of the developed algorithms by the lung cancer data and simulated data. RESULTS The developed R package, compound.Cox, is available on the CRAN repository. The statistical tools in compound.Cox allow researchers to determine an optimal significance level of the tests, thus providing researchers an optimal subset of genes for prediction. The package also allows researchers to compute the false discovery rate and various prediction algorithms.
Collapse
Affiliation(s)
- Takeshi Emura
- Graduate Institute of Statistics, National Central University, Zhongda Road, Zhongli District, Taoyuan 32001, Taiwan.
| | - Shigeyuki Matsui
- Department of Biostatistics, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Hsuan-Yu Chen
- Institute of Statistical Science, Academia Sinica, 128 Academia Road Sec.2, Nankang Taipei 115, Taiwan
| |
Collapse
|
168
|
Abstract
The cluster analysis has been widely applied by researchers from several scientific fields over the last decades. Advances in knowledge of biological phenomena have revived a great interest in cluster analysis due in part to the large amount of microarray data. Traditional clustering algorithms show, apart from the need of user-defined parameters, clear limitations to handle microarray data owing to its inherent characteristics: high-dimensional-low-sample-sized, highly redundant, and noisy. That has motivated the study of clustering algorithms tailored to the task of analyzing microarray data, which currently continue being developed and adapted. The present chapter is devoted to review clustering methods with different cluster analysis approaches in the challenging context of microarray data. Furthermore, the validation of the clustering results is briefly discussed by means of validity indexes used to assess the goodness of the number of clusters and the induced cluster assignments.
Collapse
Affiliation(s)
| | - Juana-María Vivo
- Department of Statistics and Operations Research, University of Murcia, Murcia, Spain.
| |
Collapse
|
169
|
Feature Selection Applied to Microarray Data. Methods Mol Biol 2019; 1986:123-152. [PMID: 31115887 DOI: 10.1007/978-1-4939-9442-7_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
A typical characteristic of microarray data is that it has a very high number of features (in the order of thousands) while the number of examples is usually less than 100. In the context of microarray classification, this poses a challenge for machine learning methods, which can suffer overfitting and thus degradation in their performance. A common solution is to apply a dimensionality reduction technique before classification, to reduce the number of features. This chapter will be focused on one of the most famous dimensionality reduction techniques: feature selection. We will see how feature selection can help improve the classification accuracy in several microarray data scenarios.
Collapse
|
170
|
Jeong SJ, Lee HJ, Lee SD, Lee SH, Park SJ, Kim JS, Lee JW. Classification of Common Relationships Based on Short Tandem Repeat Profiles Using Data Mining. ACTA ACUST UNITED AC 2019. [DOI: 10.7580/kjlm.2019.43.3.97] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Su Jin Jeong
- Department of Statistics, Korea University, Seoul, Korea
| | | | - Soong Deok Lee
- Department of Forensic Medicine, Seoul National University College of Medicine, Seoul, Korea
| | - Seung Hwan Lee
- Forensic Science Division 2, Supreme Prosecutor's Office, Seoul, Korea
| | - Su Jeong Park
- Forensic Science Division 2, Supreme Prosecutor's Office, Seoul, Korea
| | - Jong Sik Kim
- Forensic Science Division 2, Supreme Prosecutor's Office, Seoul, Korea
| | - Jae Won Lee
- Department of Statistics, Korea University, Seoul, Korea
| |
Collapse
|
171
|
Yuan M, Yang Z, Ji G. Partial maximum correlation information: A new feature selection method for microarray data classification. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.09.084] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
172
|
Yahya AA. Swarm intelligence-based approach for educational data classification. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2019. [DOI: 10.1016/j.jksuci.2017.08.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
173
|
Ranjbar S, Velgos SN, Dueck AC, Geda YE, Mitchell JR. Brain MR Radiomics to Differentiate Cognitive Disorders. J Neuropsychiatry Clin Neurosci 2019; 31:210-219. [PMID: 30636564 PMCID: PMC6626704 DOI: 10.1176/appi.neuropsych.17120366] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
OBJECTIVE Subtle and gradual changes occur in the brain years before cognitive impairment due to age-related neurodegenerative disorders. The authors examined the utility of hippocampal texture analysis and volumetric features extracted from brain magnetic resonance (MR) data to differentiate between three cognitive groups (cognitively normal individuals, individuals with mild cognitive impairment, and individuals with Alzheimer's disease) and neuropsychological scores on the Clinical Dementia Rating (CDR) scale. METHODS Data from 173 unique patients with 3-T T1-weighted MR images from the Alzheimer's Disease Neuroimaging Initiative database were analyzed. A variety of texture and volumetric features were extracted from bilateral hippocampal regions and were used to perform binary classification of cognitive groups and CDR scores. The authors used diagonal quadratic discriminant analysis in a leave-one-out cross-validation scheme. Sensitivity, specificity, and area under the receiver operating characteristic curve were used to assess the performance of models. RESULTS The results show promise for hippocampal texture analysis to distinguish between no impairment and early stages of impairment. Volumetric features were more successful at differentiating between no impairment and advanced stages of impairment. CONCLUSIONS MR radiomics may be a promising tool to classify various cognitive groups.
Collapse
Affiliation(s)
| | - Stefanie N. Velgos
- Center for Clinical and Translational Science, Mayo Clinic
Graduate School of Biomedical Sciences, Mayo Clinic Arizona
| | | | - Yonas E. Geda
- Department of Psychiatry and Psychology, Mayo Clinic
Arizona,Department of Neurology, Mayo Clinic Arizona
| | - J. Ross Mitchell
- Department of Physiology and Biomedical Engineering, Mayo
Clinic Arizona,Corresponding author (J. Ross Mitchell)
. Department of Physiology and
Biomedical Engineering, Mayo Clinic Arizona 5777 E. Mayo Boulevard, Phoenix, AZ
85054, phone: 480-301-5177
| | | |
Collapse
|
174
|
Tutino VM, Poppenberg KE, Li L, Shallwani H, Jiang K, Jarvis JN, Sun Y, Snyder KV, Levy EI, Siddiqui AH, Kolega J, Meng H. Biomarkers from circulating neutrophil transcriptomes have potential to detect unruptured intracranial aneurysms. J Transl Med 2018; 16:373. [PMID: 30593281 PMCID: PMC6310942 DOI: 10.1186/s12967-018-1749-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Accepted: 12/17/2018] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Intracranial aneurysms (IAs) are dangerous because of their potential to rupture and cause deadly subarachnoid hemorrhages. Previously, we found significant RNA expression differences in circulating neutrophils between patients with unruptured IAs and aneurysm-free controls. Searching for circulating biomarkers for unruptured IAs, we tested the feasibility of developing classification algorithms that use neutrophil RNA expression levels from blood samples to predict the presence of an IA. METHODS Neutrophil RNA extracted from blood samples from 40 patients (20 with angiography-confirmed unruptured IA, 20 angiography-confirmed IA-free controls) was subjected to next-generation RNA sequencing to obtain neutrophil transcriptomes. In a randomly-selected training cohort of 30 of the 40 samples (15 with IA, 15 controls), we performed differential expression analysis. Significantly differentially expressed transcripts (false discovery rate < 0.05, fold change ≥ 1.5) were used to construct prediction models for IA using four well-known supervised machine-learning approaches (diagonal linear discriminant analysis, cosine nearest neighbors, nearest shrunken centroids, and support vector machines). These models were tested in a testing cohort of the remaining 10 neutrophil samples from the 40 patients (5 with IA, 5 controls), and model performance was assessed by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (PCR) was used to corroborate expression differences of a subset of model transcripts in neutrophil samples from a new, separate validation cohort of 10 patients (5 with IA, 5 controls). RESULTS The training cohort yielded 26 highly significantly differentially expressed neutrophil transcripts. Models using these transcripts identified IA patients in the testing cohort with accuracy ranging from 0.60 to 0.90. The best performing model was the diagonal linear discriminant analysis classifier (area under the ROC curve = 0.80 and accuracy = 0.90). Six of seven differentially expressed genes we tested were confirmed by quantitative PCR using isolated neutrophils from the separate validation cohort. CONCLUSIONS Our findings demonstrate the potential of machine-learning methods to classify IA cases and create predictive models for unruptured IAs using circulating neutrophil transcriptome data. Future studies are needed to replicate these findings in larger cohorts.
Collapse
Affiliation(s)
- Vincent M. Tutino
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Biomedical Engineering, University at Buffalo, Buffalo, NY USA
| | - Kerry E. Poppenberg
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Biomedical Engineering, University at Buffalo, Buffalo, NY USA
| | - Lu Li
- Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY USA
| | - Hussain Shallwani
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - Kaiyu Jiang
- Genetics, Genomics, and Bioinformatics Program, University at Buffalo, Buffalo, NY USA
| | - James N. Jarvis
- Genetics, Genomics, and Bioinformatics Program, University at Buffalo, Buffalo, NY USA
- Department of Pediatrics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - Yijun Sun
- Genetics, Genomics, and Bioinformatics Program, University at Buffalo, Buffalo, NY USA
- Department of Microbiology and Immunology, University at Buffalo, Buffalo, NY USA
| | - Kenneth V. Snyder
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
- Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
- Department of Neurology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - Elad I. Levy
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
- Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - Adnan H. Siddiqui
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
- Department of Radiology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - John Kolega
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Pathology and Anatomical Sciences, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
| | - Hui Meng
- Canon Stroke and Vascular Research Center, University at Buffalo, Clinical and Translational Research Center, 875 Ellicott Street, Buffalo, NY 14214 USA
- Department of Biomedical Engineering, University at Buffalo, Buffalo, NY USA
- Department of Neurosurgery, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY USA
- Department of Mechanical & Aerospace Engineering, University at Buffalo, Buffalo, NY USA
| |
Collapse
|
175
|
Feltes BC, Grisci BI, Poloni JDF, Dorn M. Perspectives and applications of machine learning for evolutionary developmental biology. Mol Omics 2018; 14:289-306. [PMID: 30168572 DOI: 10.1039/c8mo00111a] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Evolutionary Developmental Biology (Evo-Devo) is an ever-expanding field that aims to understand how development was modulated by the evolutionary process. In this sense, "omic" studies emerged as a powerful ally to unravel the molecular mechanisms underlying development. In this scenario, bioinformatics tools become necessary to analyze the growing amount of information. Among computational approaches, machine learning stands out as a promising field to generate knowledge and trace new research perspectives for bioinformatics. In this review, we aim to expose the current advances of machine learning applied to evolution and development. We draw clear perspectives and argue how evolution impacted machine learning techniques.
Collapse
Affiliation(s)
- Bruno César Feltes
- Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.
| | | | | | | |
Collapse
|
176
|
Slama P, Hoopmann MR, Moritz RL, Geman D. Robust determination of differential abundance in shotgun proteomics using nonparametric statistics. Mol Omics 2018; 14:424-436. [PMID: 30259924 PMCID: PMC6490964 DOI: 10.1039/c8mo00077h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Label-free shotgun mass spectrometry enables the detection of significant changes in protein abundance between different conditions. Due to often limited cohort sizes or replication, large ratios of potential protein markers to number of samples, as well as multiple null measurements pose important technical challenges to conventional parametric models. From a statistical perspective, a scenario similar to that of unlabeled proteomics is encountered in genomics when looking for differentially expressed genes. Still, the difficulty of detecting a large fraction of the true positives without a high false discovery rate is arguably greater in proteomics due to even smaller sample sizes and peptide-to-peptide variability in detectability. These constraints argue for nonparametric (or distribution-free) tests on normalized peptide values, thus minimizing the number of free parameters, as well as for measuring significance with permutation testing. We propose such a procedure with a class-based statistic, no parametric assumptions, and no parameters to select other than a nominal false discovery rate. Our method was tested on a new dataset which is available via ProteomeXchange with identifier PXD006447. The dataset was prepared using a standard proteolytic digest of a human protein mixture at 1.5-fold to 3-fold protein concentration changes and diluted into a constant background of yeast proteins. We demonstrate its superiority relative to other approaches in terms of the realized sensitivity and realized false discovery rates determined by ground truth, and recommend it for detecting differentially abundant proteins from MS data.
Collapse
Affiliation(s)
- Patrick Slama
- Center for Imaging Science, Institute for Computational Medicine, Johns Hopkins University, USA.
- Independent Researcher, Paris, France
| | | | - Robert L. Moritz
- Institute for Systems Biology, 401 Terry Avenue N, Seattle, WA, USA 98109
| | - Donald Geman
- Center for Imaging Science, Institute for Computational Medicine, Johns Hopkins University, USA.
- Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD, 21218
| |
Collapse
|
177
|
Ding J, Gu C, Huang L, Tan R. Discrimination and Geographical Origin Prediction of Cynomorium songaricum Rupr. from Different Growing Areas in China by an Electronic Tongue. JOURNAL OF ANALYTICAL METHODS IN CHEMISTRY 2018; 2018:5894082. [PMID: 30595938 PMCID: PMC6282117 DOI: 10.1155/2018/5894082] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Accepted: 10/31/2018] [Indexed: 05/29/2023]
Abstract
Cynomorium songaricum Rupr. is a well-known and widespread plant in China. It has very high medicinal values in many aspects. The study aimed at discriminating and predicting C. songaricum from major growing areas in China. An electronic tongue was used to analyze C. songaricum based on flavor. Discrimination was achieved by principal component analysis and linear discriminant analysis. Moreover, a prediction model was established, and C. songaricum was classified by geographical origins with 100% degree of accuracy. Therefore, the identification method presented will be helpful for further study of C. songaricum.
Collapse
Affiliation(s)
- Jiaji Ding
- College of Medcine, Southwest Jiaotong University, Chengdu 610031, China
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100193, China
| | - Caimei Gu
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100193, China
| | - Linfang Huang
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100193, China
| | - Rui Tan
- College of Medcine, Southwest Jiaotong University, Chengdu 610031, China
| |
Collapse
|
178
|
Banizs AB, Silverman JF. The utility of combined mutation analysis and microRNA classification in reclassifying cancer risk of cytologically indeterminate thyroid nodules. Diagn Cytopathol 2018; 47:268-274. [DOI: 10.1002/dc.24087] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2018] [Revised: 08/15/2018] [Accepted: 09/10/2018] [Indexed: 01/11/2023]
Affiliation(s)
- Anna B. Banizs
- Department of Pathology; Allegheny General Hospital; Pittsburgh Pennsylvania
| | - Jan F. Silverman
- Department of Pathology; Allegheny General Hospital; Pittsburgh Pennsylvania
| |
Collapse
|
179
|
Shi Y, Li F, Liu T, Beyette FR, Song W. Dynamic Time-frequency Feature Extraction for Brain Activity Recognition. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2018; 2018:3104-3107. [PMID: 30441051 DOI: 10.1109/embc.2018.8512914] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The biomedical signal classification accuracy on motor imagery is not always satisfactory, partially because not all the important features have been effectively extracted. This paper proposes an improved dynamic feature extraction approach based on a time-frequency representation and an optimal sequence similarity measurement. Since the wavelet packet decomposition (WPD) generates more detailed signal variation information and the dynamic time warping (DTW) helps optimally measure the sequence similarity, more important features are kept for classification. We apply the extracted features from our proposed method to Electroencephalogram (EEG) based motor imagery through the OpenBCI device and obtain higher classification accuracy. Compared with traditional feature extraction methods, there is a significant classification accuracy improvement from 83.53% to 90.89%. Our work demonstrates the importance of the advanced feature extraction in time series data analysis, e.g. biomedical signal.
Collapse
|
180
|
Tal O, Tran TD. New perspectives on multilocus ancestry informativeness. Math Biosci 2018; 306:60-81. [PMID: 30385120 DOI: 10.1016/j.mbs.2018.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Revised: 10/24/2018] [Accepted: 10/25/2018] [Indexed: 10/28/2022]
Abstract
We present an axiomatic approach for multilocus informativeness measures for determining the amount of information that a set of polymorphic genetic markers provides about individual ancestry. We then reveal several surprising properties of a decision-theoretic based measure that is consistent with the set of proposed criteria for multilocus informativeness. In particular, these properties highlight the interplay between information originating from population priors and the information extractable from the population genetic variants. This analysis then reveals a certain deficiency of mutual information based multilocus informativeness measures when such population priors are incorporated. Finally, we analyse and quantify the inevitable inherent decrease in informativeness due to learning from finite population samples.
Collapse
Affiliation(s)
- Omri Tal
- Max-Planck-Institute for Mathematics in the Sciences, Inselstrasse 22, Leipzig D-04103 Germany.
| | - Tat Dat Tran
- Max-Planck-Institute for Mathematics in the Sciences, Inselstrasse 22, Leipzig D-04103 Germany.
| |
Collapse
|
181
|
Shah K, Patel S, Mirza S, Rawal RM. A multi-gene expression profile panel for predicting liver metastasis: An algorithmic approach. PLoS One 2018; 13:e0206400. [PMID: 30383826 PMCID: PMC6211708 DOI: 10.1371/journal.pone.0206400] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Accepted: 10/14/2018] [Indexed: 12/17/2022] Open
Abstract
Background & aim Liver metastasis has been found to affect outcome in prostate, pancreatic and colorectal cancers, but its role in lung cancer is unclear. The 5 year survival rate remains extensively low owing to intrinsic resistance to conventional therapy which can be attributed to the genetic modulators involved in the pathogenesis of the disease. Thus, this study aims to generate a model for early diagnosis and timely treatment of liver metastasis in lung cancer patients. Methods mRNA expression of 15 genes was quantified by real time PCR on lung cancer specimens with (n = 32) and without (n = 30) liver metastasis and their normal counterparts. Principal Component analysis, linear discriminant analysis and hierarchical clustering were conducted to obtain a predictive model. The accuracy of the models was tested by performing Receiver Operating Curve analysis. Results The expression profile of all the 15 genes were subjected to PCA and LDA analysis and 5 models were generated. ROC curve analysis was performed for all the models and the individual genes. It was observed that out of the 15 genes only 8 genes showed significant sensitivity and specificity. Another model consisting of the selected eight genes was generated showing a specificity and sensitivity of 90.0 and 96.87 respectively (p <0.0001). Moreover, hierarchical clustering showed that tumors with a greater fold change lead to poor prognosis. Conclusion Our study led to the generation of a concise, biologically relevant multi-gene panel that significantly and non-invasively predicts liver metastasis in lung cancer patients.
Collapse
Affiliation(s)
- Kanisha Shah
- Division of Medicinal Chemistry & Pharmacogenomics, Department of Cancer Biology, The Gujarat Cancer & Research Institute, Ahmedabad, Gujarat, India
| | - Shanaya Patel
- Division of Medicinal Chemistry & Pharmacogenomics, Department of Cancer Biology, The Gujarat Cancer & Research Institute, Ahmedabad, Gujarat, India
| | - Sheefa Mirza
- Division of Medicinal Chemistry & Pharmacogenomics, Department of Cancer Biology, The Gujarat Cancer & Research Institute, Ahmedabad, Gujarat, India
| | - Rakesh M. Rawal
- Division of Medicinal Chemistry & Pharmacogenomics, Department of Cancer Biology, The Gujarat Cancer & Research Institute, Ahmedabad, Gujarat, India
- * E-mail: ,
| |
Collapse
|
182
|
Hu Z, Tong T, Genton MG. Diagonal likelihood ratio test for equality of mean vectors in high-dimensional data. Biometrics 2018; 75:256-267. [PMID: 30325005 DOI: 10.1111/biom.12984] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 09/14/2018] [Indexed: 11/27/2022]
Abstract
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling's tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not need the requirement that the covariance matrices follow a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and readily applicable in practice. Simulation studies and a real data analysis are also carried out to demonstrate the advantages of our likelihood ratio test methods.
Collapse
Affiliation(s)
- Zongliang Hu
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, 518060, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Hong Kong
| | - Marc G Genton
- Statistics Program, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
183
|
Cai J, Huang X. Modified Sparse Linear-Discriminant Analysis via Nonconvex Penalties. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4957-4966. [PMID: 29994754 DOI: 10.1109/tnnls.2017.2785324] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper considers the linear-discriminant analysis (LDA) problem in the undersampled situation, in which the number of features is very large and the number of observations is limited. Sparsity is often incorporated in the solution of LDA to make a well interpretation of the results. However, most of the existing sparse LDA algorithms pursue sparsity by means of the $\ell _{1}$ -norm. In this paper, we give elaborate analysis for nonconvex penalties, including the $\ell _{0}$ -based and the sorted $\ell _{1}$ -based LDA methods. The latter one can be regarded as a bridge between the $\ell _{0}$ and $\ell _{1}$ penalties. These nonconvex penalty-based LDA algorithms are evaluated on the gene expression array and face database, showing high classification accuracy on real-world problems.
Collapse
|
184
|
Zarei S, Mohammadpour A, Rezakhah S. Finite population Bayesian bootstrapping in high-dimensional classification via logistic regression. INTELL DATA ANAL 2018. [DOI: 10.3233/ida-173536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
185
|
Atrash S, Zhang Q, Papanikolaou X, Stein C, Abdallah AO, Barlogie B. Clinical Presentation and Gene Expression Profiling of Immunoglobulin M Multiple Myeloma Compared With Other Myeloma Subtypes and Waldenström Macroglobulinemia. J Glob Oncol 2018; 4:1-8. [PMID: 30241189 PMCID: PMC6180798 DOI: 10.1200/jgo.2016.008003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Multiple myeloma (MM) is a clonal bone marrow disease characterized by the neoplastic transformation of differentiated postgerminal B cells. It is a heterogeneous disease both at the genetic level and in terms of clinical outcome. Immunoglobulin M (IgM) MM is a rare subtype of myeloma. Similar to Waldenström macroglobulinemia (WM), patients with MM experience IgM monoclonal gammopathy; however, both diseases are distinct in terms of treatment and clinical behavior. MATERIALS AND METHODS To shed light on the presentation of IgM MM, its prognosis, and its gene expression profiling, we identified and characterized 21 patients with IgM MM from our database. RESULTS One of these patients presented with a rare IgM monoclonal gammopathy of undetermined significance that progressed to smoldering myeloma. The median survival of the 21 patients was 4.9 years, which was comparable to a matched group of patients with non-IgM MM with similar myeloma prognostic factors (age, gender, albumin, creatinine, anemia, lactate dehydrogenase, β2-microglobulin, cytogenetics abnormalities), but much less than the median survival reported for patients with WM (9 years). We identified a cluster of genes that differ in their expression profile between MM and WM and found that the patients with IgM MM displayed a gene expression profile most similar to patients with non-IgM MM, confirming that IgM MM is a subtype of MM that should be differentiated from WM. CONCLUSION Because the prognosis of IgM MM and WM differ significantly, an accurate diagnosis is essential. Our gene expression model can assist with the differential diagnosis in controversial cases.
Collapse
Affiliation(s)
- Shebli Atrash
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| | - Qing Zhang
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| | - Xenofon Papanikolaou
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| | - Caleb Stein
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| | - Al-Ola Abdallah
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| | - Bart Barlogie
- Shebli Atrash, Qing Zhang, Xenofon Papanikolaou, Caleb Stein, Al-Ola Abdallah, and Bart Barlogie, University of Arkansas for Medical Sciences, Little Rock, Arkansas; and Qing Zhang, Levine Cancer Institute, Charlotte, NC
| |
Collapse
|
186
|
Bazzoli C, Lambert-Lacroix S. Classification based on extensions of LS-PLS using logistic regression: application to clinical and multiple genomic data. BMC Bioinformatics 2018; 19:314. [PMID: 30189832 PMCID: PMC6127926 DOI: 10.1186/s12859-018-2311-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Accepted: 08/13/2018] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND To address high-dimensional genomic data, most of the proposed prediction methods make use of genomic data alone without considering clinical data, which are often available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions. We consider here methods for classification purposes that simultaneously use both types of variables but apply dimensionality reduction only to the high-dimensional genomic ones. RESULTS Using partial least squares (PLS), we propose some one-step approaches based on three extensions of the least squares (LS)-PLS method for logistic regression. A comparison of their prediction performances via a simulation and on real data sets from cancer studies is conducted. CONCLUSION In general, those methods using only clinical data or only genomic data perform poorly. The advantage of using LS-PLS methods for classification and their performances are shown and then used to analyze clinical and genomic data. The corresponding prediction results are encouraging and stable regardless of the data set and/or number of selected features. These extensions have been implemented in the R package lsplsGlm to enhance their use.
Collapse
Affiliation(s)
- Caroline Bazzoli
- Laboratoire Jean Kuntzman, Univ. Grenoble-Alpes, 700 avenue centrale, Saint Martin d’Hères, 38401 France
| | | |
Collapse
|
187
|
Rauf Ahmad M, Pavlenko T. A U-classifier for high-dimensional data under non-normality. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.05.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
188
|
Rodríguez-Girondo M, Salo P, Burzykowski T, Perola M, Houwing-Duistermaat J, Mertens B. Sequential double cross-validation for assessment of added predictive ability in high-dimensional omic applications. Ann Appl Stat 2018. [DOI: 10.1214/17-aoas1125] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
189
|
Ashour AS, Hawas AR, Guo Y. Comparative study of multiclass classification methods on light microscopic images for hepatic schistosomiasis fibrosis diagnosis. Health Inf Sci Syst 2018; 6:7. [PMID: 30151186 DOI: 10.1007/s13755-018-0047-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 08/07/2018] [Indexed: 01/26/2023] Open
Abstract
Hepatic schistosomiasis is a prolonged disease resulting mainly from the solvable egg antigen of schistosomiasis infection due to the host's granulomatous cell-mediated immune. Irreversible fibrosis results from the progress of the schistosomal hepatopathy. Sensitive diagnosis of this disease is based on the investigation of the microscopy images, liver tissues, and egg identification. Early diagnosis of schistosomiasis at its initial infection stage is vital to avoid egg-induced irreparable pathological reactions. Typically, there are several classification approaches that can be used for liver fibrosis staging. However, it is unclear which approaches can achieve high accuracy for analyzing and intelligently classifying the liver microscopic images. Consequently, this work aims to study the performance of the different machine learning classifiers for accurate fibrosis level staging of granuloma, namely cellular, fibrocellular and fibrotic granulomas as well as the normal samples. The classifiers include a multi-layer perceptron neural network, a decision tree, discriminant analysis, support vector machine (SVM), nearest neighbor, and the ensemble of classifiers. The statistical features of the microscopic images are extracted from the different fibrosis levels of granuloma, namely cellular, fibrocellular and fibrotic granulomas as well as the normal samples. The results established that the maximum achieved classification accuracies of value 90% were achieved using the subspace discriminant ensemble, the quadratic SVM, the linear SVM, or the linear discriminant classifiers. However, the linear discriminant classifier can be considered the superior classifier as it realized the best area under the curve of value 0.96 during the classification of the cellular granuloma as well as the fibro-cellular granuloma fibrosis levels.
Collapse
Affiliation(s)
- Amira S Ashour
- 1Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Ahmed Refaat Hawas
- 1Department of Electronics and Electrical Communication Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt
| | - Yanhui Guo
- 2Department of Computer Science, University of Illinois at Springfield, Springfield, IL USA
| |
Collapse
|
190
|
Alshamlan HM. DQB: A novel dynamic quantitive classification model using artificial bee colony algorithm with application on gene expression profiles. Saudi J Biol Sci 2018; 25:932-946. [PMID: 30108444 PMCID: PMC6087852 DOI: 10.1016/j.sjbs.2018.01.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Revised: 01/30/2018] [Accepted: 01/31/2018] [Indexed: 02/01/2023] Open
Abstract
In the medical domain, it is very significant to develop a rule-based classification model. This is because it has the ability to produce a comprehensible and understandable model that accounts for the predictions. Moreover, it is desirable to know not only the classification decisions but also what leads to these decisions. In this paper, we propose a novel dynamic quantitative rule-based classification model, namely DQB, which integrates quantitative association rule mining and the Artificial Bee Colony (ABC) algorithm to provide users with more convenience in terms of understandability and interpretability via an accurate class quantitative association rule-based classifier model. As far as we know, this is the first attempt to apply the ABC algorithm in mining for quantitative rule-based classifier models. In addition, this is the first attempt to use quantitative rule-based classification models for classifying microarray gene expression profiles. Also, in this research we developed a new dynamic local search strategy named DLS, which is improved the local search for artificial bee colony (ABC) algorithm. The performance of the proposed model has been compared with well-known quantitative-based classification methods and bio-inspired meta-heuristic classification algorithms, using six gene expression profiles for binary and multi-class cancer datasets. From the results, it can be concludes that a considerable increase in classification accuracy is obtained for the DQB when compared to other available algorithms in the literature, and it is able to provide an interpretable model for biologists. This confirms the significance of the proposed algorithm in the constructing a classifier rule-based model, and accordingly proofs that these rules obtain a highly qualified and meaningful knowledge extracted from the training set, where all subset of quantitive rules report close to 100% classification accuracy with a minimum number of genes. It is remarkable that apparently (to the best of our knowledge) several new genes were discovered that have not been seen in any past studies. For the applicability demand, based on the results acqured from microarray gene expression analysis, we can conclude that DQB can be adopted in a different real world applications with some modifications.
Collapse
Affiliation(s)
- Hala M Alshamlan
- Information Technology Department, King Saud University, Riyadh, Saudi Arabia.,Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, United States
| |
Collapse
|
191
|
Berrar D, Lopes P, Dubitzky W. Incorporating domain knowledge in machine learning for soccer outcome prediction. Mach Learn 2018. [DOI: 10.1007/s10994-018-5747-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
192
|
Zhi X, Yan H, Fan J, Zheng S. Efficient discriminative clustering via QR decomposition-based Linear Discriminant Analysis. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.04.031] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
193
|
Mbah C, De Neve J, Thas O. High-dimensional prediction of binary outcomes in the presence of between-study heterogeneity. Stat Methods Med Res 2018; 28:2848-2867. [PMID: 30051767 DOI: 10.1177/0962280218787544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Many prediction methods have been proposed in the literature, but most of them ignore heterogeneity between populations. Either only data from a single study or population is available for model building and evaluation, or when data from multiple studies make up the training dataset, studies are pooled before model building. As a result, prediction models might perform less than expected when applied to new subjects from new study populations. We propose a linear method for building prediction models with high-dimensional data from multiple studies. Our method explicitly addresses between-population variability and tends to select predictors that are predictive in most of the study populations. We employ empirical Bayes estimators and hence avoid selection bias during the variable selection process. Simulation results demonstrate that the new method works better than other linear prediction methods that ignore the between-study variability. Our method is developed for classification into two groups.
Collapse
Affiliation(s)
- Chamberlain Mbah
- 1 Department of Radiotherapy and Experimental Cancer Research, Ghent University, Ghent, Belgium
| | - Jan De Neve
- 2 Department of Data Analysis, Faculty of Psychology and Educational Sciences, Ghent University, Ghent, Belgium
| | - Olivier Thas
- 3 Center for Statistics, Hasselt University, Diepenbeek, Belgium.,4 National Institute for Applied Statistics Research Australia, University of Wollongong, New South Wales, Australia
| |
Collapse
|
194
|
Lee UJ, Tzeng S, Chen YC, Chen JJ. Prognostic and predictive signatures for treatment decisions. Biomark Med 2018; 12:849-859. [PMID: 30022678 DOI: 10.2217/bmm-2017-0320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
AIM We develop a subgroup selection procedure using both prognostic and predictive biomarkers to identify four patient subpopulations: low- and high-risk responders, and low- and high-risk nonresponders. METHODS We utilize three regression models to identify three sets of biomarkers: S, prognostic biomarkers; T, predictive biomarkers; and U, prognostic and predictive biomarkers. The prognostic signature C(S) combines with a predictive signature, either C(T) or C(U), to develop two procedures C(S,T) and C(S,U) for identification of four subgroups. RESULTS Simulation experiment showed that proposed models for identifying the biomarker sets S and U performed well, as did the procedure C(S,U) for subgroup identification. CONCLUSION The proposed model provides more comprehensive characterization of patient subpopulations, and better accuracy in patient treatment assignment.
Collapse
Affiliation(s)
- Un Jung Lee
- Division of Biochemical Toxicology, National Center for Toxicological Research, US FDA, 3900 NCTR Road, Jefferson, AR 72079, USA
| | - ShengLi Tzeng
- Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan
| | - Yu-Chuan Chen
- Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Road, Jefferson, AR 72029, USA
| | - James J Chen
- Department of Biostatistics, University of Arkansas for Medical Science, Little Rock, AR 72205, USA
| |
Collapse
|
195
|
Yang A, Jiang X, Shu L, Liu P. Sparse bayesian kernel multinomial probit regression model for high-dimensional data classification. COMMUN STAT-THEOR M 2018. [DOI: 10.1080/03610926.2018.1463385] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- Aijun Yang
- College of Economics and Management, Nanjing Forestry University, Nanjing, China
- Key Laboratory of Statistical Information Technology and Data Mining, State Statistics Bureau, Chengdu, China
| | - Xuejun Jiang
- Department of Mathematics, South University of Science and Technology of China, Shenzhen, China
| | - Lianjie Shu
- Faculty of Business Administration, University of Macau, Macau, China
| | - Pengfei Liu
- School of Mathematics and Statistics, Jiangsu Normal University, Xuzhou, China
| |
Collapse
|
196
|
Hong S, Kwon H, Choi SH, Park KS. Intelligent system for drowsiness recognition based on ear canal electroencephalography with photoplethysmography and electrocardiography. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.04.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
197
|
Abstract
AbstractIn this paper, we consider high-dimensional quadratic classifiers in non-sparse settings. The quadratic classifiers proposed in this paper draw information about heterogeneity effectively through both the differences of growing mean vectors and covariance matrices. We show that they hold a consistency property in which misclassification rates tend to zero as the dimension goes to infinity under non-sparse settings. We also propose a quadratic classifier after feature selection by using both the differences of mean vectors and covariance matrices. We discuss the performance of the classifiers in numerical simulations and actual data analyzes. Finally, we give concluding remarks about the choice of the classifiers for high-dimensional, non-sparse data.
Collapse
|
198
|
Li L, Yao W. Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection. J STAT COMPUT SIM 2018. [DOI: 10.1080/00949655.2018.1490418] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- Longhai Li
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Weixin Yao
- Department of Statistics, University of California at Riverside, Riverside, CA, USA
| |
Collapse
|
199
|
RNA sequencing data from neutrophils of patients with cystic fibrosis reveals potential for developing biomarkers for pulmonary exacerbations. J Cyst Fibros 2018; 18:194-202. [PMID: 29941318 DOI: 10.1016/j.jcf.2018.05.014] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Revised: 05/01/2018] [Accepted: 05/22/2018] [Indexed: 01/16/2023]
Abstract
BACKGROUND There is no effective way to predict cystic fibrosis (CF) pulmonary exacerbations (CFPE) before they become symptomatic or to assess satisfactory treatment responses. METHODS RNA sequencing of peripheral blood neutrophils from CF patients before and after therapy for CFPE was used to create transcriptome profiles. Transcripts with an average transcripts per million (TPM) level > 1.0 and a false discovery rate (FDR) < 0.05 were used in a cosine K-nearest neighbor (KNN) model. Real time PCR was used to corroborate RNA sequencing expression differences in both neutrophils and whole blood samples from an independent cohort of CF patients. Furthermore, sandwich ELISA was conducted to assess plasma levels of MRP8/14 complexes in CF patients before and after therapy. RESULTS We found differential expression of 136 transcripts and 83 isoforms when we compared neutrophils from CF patients before and after therapy (>1.5 fold change, FDR-adjusted P < 0.05). The model was able to successfully separate CF flare samples from those taken from the same patients in convalescence with an accuracy of 0.75 in both the training and testing cohorts. Six differently expressed genes were confirmed by real time PCR using both isolated neutrophils and whole blood from an independent cohort of CF patients before and after therapy, even though levels of myeloid related protein MRP8/14 dimers in plasma of CF patients were essentially unchanged by therapy. CONCLUSIONS Our findings demonstrate the potential of machine learning approaches for classifying disease states and thus developing sensitive biomarkers that can be used to monitor pulmonary disease activity in CF.
Collapse
|
200
|
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 2018; 36:411-420. [PMID: 29608179 PMCID: PMC6700744 DOI: 10.1038/nbt.4096] [Citation(s) in RCA: 7485] [Impact Index Per Article: 1069.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2017] [Accepted: 02/09/2018] [Indexed: 02/06/2023]
Abstract
Computational single-cell RNA-seq (scRNA-seq) methods have been successfully applied to experiments representing a single condition, technology, or species to discover and define cellular phenotypes. However, identifying subpopulations of cells that are present across multiple data sets remains challenging. Here, we introduce an analytical strategy for integrating scRNA-seq data sets based on common sources of variation, enabling the identification of shared populations across data sets and downstream comparative analysis. We apply this approach, implemented in our R toolkit Seurat (http://satijalab.org/seurat/), to align scRNA-seq data sets of peripheral blood mononuclear cells under resting and stimulated conditions, hematopoietic progenitors sequenced using two profiling technologies, and pancreatic cell 'atlases' generated from human and mouse islets. In each case, we learn distinct or transitional cell states jointly across data sets, while boosting statistical power through integrated analysis. Our approach facilitates general comparisons of scRNA-seq data sets, potentially deepening our understanding of how distinct cell states respond to perturbation, disease, and evolution.
Collapse
Affiliation(s)
- Andrew Butler
- New York Genome Center, New York, NY 10013, USA
- Center for Genomics and Systems Biology, New York University, New York, NY 10003-6688, USA
| | | | | | - Efthymia Papalexi
- New York Genome Center, New York, NY 10013, USA
- Center for Genomics and Systems Biology, New York University, New York, NY 10003-6688, USA
| | - Rahul Satija
- New York Genome Center, New York, NY 10013, USA
- Center for Genomics and Systems Biology, New York University, New York, NY 10003-6688, USA
| |
Collapse
|