1
|
Ho SY, Wong L, Goh WWB. Avoid Oversimplifications in Machine Learning: Going beyond the Class-Prediction Accuracy. PATTERNS 2020; 1:100025. [PMID: 33205097 PMCID: PMC7660406 DOI: 10.1016/j.patter.2020.100025] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Class-prediction accuracy provides a quick but superficial way of determining classifier performance. It does not inform on the reproducibility of the findings or whether the selected or constructed features used are meaningful and specific. Furthermore, the class-prediction accuracy oversummarizes and does not inform on how training and learning have been accomplished: two classifiers providing the same performance in one validation can disagree on many future validations. It does not provide explainability in its decision-making process and is not objective, as its value is also affected by class proportions in the validation set. Despite these issues, this does not mean we should omit the class-prediction accuracy. Instead, it needs to be enriched with accompanying evidence and tests that supplement and contextualize the reported accuracy. This additional evidence serves as augmentations and can help us perform machine learning better while avoiding naive reliance on oversimplified metrics. There is a huge potential for machine learning, but blind reliance on oversimplified metrics can mislead. Class-prediction accuracy is a common metric used for determining classifier performance. This article provides examples to show how the class-prediction accuracy is superficial and even misleading. We propose some augmentative measures to supplement the class-prediction accuracy. This in turn helps us to better understand the quality of learning of the classifier.
Collapse
Affiliation(s)
- Sung Yang Ho
- School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore 117417, Singapore
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore
| |
Collapse
|
2
|
Maleki F, Ovens KL, Hogan DJ, Rezaei E, Rosenberg AM, Kusalik AJ. Measuring consistency among gene set analysis methods: A systematic study. J Bioinform Comput Biol 2019; 17:1940010. [DOI: 10.1142/s0219720019400109] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.
Collapse
Affiliation(s)
- Farhad Maleki
- Department of Computer Science, University of Saskatchewan, 110 Science Place, Saskatoon SK S7N 5C9, Canada
| | - Katie L. Ovens
- Department of Computer Science, University of Saskatchewan, 110 Science Place, Saskatoon SK S7N 5C9, Canada
| | - Daniel J. Hogan
- Department of Computer Science, University of Saskatchewan, 110 Science Place, Saskatoon SK S7N 5C9, Canada
| | - Elham Rezaei
- Department of Pediatrics, Royal University Hospital, Saskatoon SK S7N OW8, Canada
| | - Alan M. Rosenberg
- Department of Pediatrics, Royal University Hospital, Saskatoon SK S7N OW8, Canada
| | - Anthony J. Kusalik
- Department of Computer Science, University of Saskatchewan, 110 Science Place, Saskatoon SK S7N 5C9, Canada
| |
Collapse
|
3
|
Saberi Ansar E, Eslahchii C, Rahimi M, Geranpayeh L, Ebrahimi M, Aghdam R, Kerdivel G. Significant random signatures reveals new biomarker for breast cancer. BMC Med Genomics 2019; 12:160. [PMID: 31703592 PMCID: PMC6842262 DOI: 10.1186/s12920-019-0609-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 10/24/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND In 2012, Venet et al. proposed that at least in the case of breast cancer, most published signatures are not significantly more associated with outcome than randomly generated signatures. They suggested that nominal p-value is not a good estimator to show the significance of a signature. Therefore, one can reasonably postulate that some information might be present in such significant random signatures. METHODS In this research, first we show that, using an empirical p-value, these published signatures are more significant than their nominal p-values. In other words, the proposed empirical p-value can be considered as a complimentary criterion for nominal p-value to distinguish random signatures from significant ones. Secondly, we develop a novel computational method to extract information that are embedded within significant random signatures. In our method, a score is assigned to each gene based on the number of times it appears in significant random signatures. Then, these scores are diffused through a protein-protein interaction network and a permutation procedure is used to determine the genes with significant scores. The genes with significant scores are considered as the set of significant genes. RESULTS First, we applied our method on the breast cancer dataset NKI to achieve a set of significant genes in breast cancer considering significant random signatures. Secondly, prognostic performance of the computed set of significant genes is evaluated using DMFS and RFS datasets. We have observed that the top ranked genes from this set can successfully separate patients with poor prognosis from those with good prognosis. Finally, we investigated the expression pattern of TAT, the first gene reported in our set, in malignant breast cancer vs. adjacent normal tissue and mammospheres. CONCLUSION Applying the method, we found a set of significant genes in breast cancer, including TAT, a gene that has never been reported as an important gene in breast cancer. Our results show that the expression of TAT is repressed in tumors suggesting that this gene could act as a tumor suppressor in breast cancer and could be used as a new biomarker.
Collapse
Affiliation(s)
- Elnaz Saberi Ansar
- Curie Institute, INSERM U830, Translational Research Department, PSL Research University, Paris, 75005 France
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Changiz Eslahchii
- Department of Computer Sciences, Faculty of Mathematical Sciences, Shahid-Beheshti University, GC, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Mahsa Rahimi
- Department of Stem Cells and Developmental Biology, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran
| | - Lobat Geranpayeh
- Department of Surgery, Sina Hospital, Tehran University of Medical Sciences, Tehran, Iran
| | - Marzieh Ebrahimi
- Department of Stem Cells and Developmental Biology, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran
| | - Rosa Aghdam
- Department of Computer Sciences, Faculty of Mathematical Sciences, Shahid-Beheshti University, GC, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Gwenneg Kerdivel
- Institut Cochin, Department Development, Reproduction, Inserm U1016, CNRS, UMR 8104, Université Paris Descartes UMR-S1016, Paris, 75014 France
| |
Collapse
|
4
|
Proteomic investigation of intra-tumor heterogeneity using network-based contextualization - A case study on prostate cancer. J Proteomics 2019; 206:103446. [PMID: 31323421 DOI: 10.1016/j.jprot.2019.103446] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 06/12/2019] [Accepted: 07/08/2019] [Indexed: 12/26/2022]
Abstract
Cancer is a heterogeneous disease, confounding the identification of relevant markers and drug targets. Network-based analysis is robust against noise, potentially offering a promising approach towards biomarker identification. We describe here the application of two network-based methods, qPSP (Quantitative Proteomics Signature Profiling) and PFSNet (Paired Fuzzy SubNetworks), in an intra-tissue proteome data set of prostate tissue samples. Despite high basal variation, we find that traditional statistical analysis may exaggerate the extent of heterogeneity. We also report that network-based analysis outperforms protein-based feature selection with concomitantly higher cross-validation accuracy. Overall, network-based analysis provides emergent signal that boosts sensitivity while retaining good precision. It is a potential means of circumventing heterogeneity for stable biomarker discovery.
Collapse
|
5
|
Khaire UM, Dhanalakshmi R. Stability of feature selection algorithm: A review. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES 2019. [DOI: 10.1016/j.jksuci.2019.06.012] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
6
|
Zhao Y, Sue ACH, Goh WWB. Deeper investigation into the utility of functional class scoring in missing protein prediction from proteomics data. J Bioinform Comput Biol 2019; 17:1950013. [DOI: 10.1142/s0219720019500136] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Functional Class Scoring (FCS) is a network-based approach previously demonstrated to be powerful in missing protein prediction (MPP). We update its performance evaluation using data derived from new proteomics technology (SWATH) and also checked for reproducibility using two independent datasets profiling kidney tissue proteome. We also evaluated the objectivity of the FCS p-value, and followed up on the value of MPP from predicted complexes. Our results suggest that (1) FCS [Formula: see text]-values are non-objective, and are confounded strongly by complex size, (2) best recovery performance do not necessarily lie at standard [Formula: see text]-value cutoffs, (3) while predicted complexes may be used for augmenting MPP, they are inferior to real complexes, and are further confounded by issues relating to network coverage and quality and (4) moderate sized complexes of size 5 to 10 still exhibit considerable instability, we find that FCS works best with big complexes. While FCS is a powerful approach, blind reliance on its non-objective [Formula: see text]-value is ill-advised.
Collapse
Affiliation(s)
- Yaxing Zhao
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Andrew Chi-Hau Sue
- School of Pharmaceutical Science and Technology, Tianjin University, No. 92, Weijin Road, 30072 Tianjin, P. R. China
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore
| |
Collapse
|
7
|
Tian S, Wang C, Wang B. Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2497509. [PMID: 31073522 PMCID: PMC6470448 DOI: 10.1155/2019/2497509] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/07/2019] [Indexed: 12/29/2022]
Abstract
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin 130021, China
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St., Lexington, KY 40536, USA
| | - Bing Wang
- School of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin 130012, China
| |
Collapse
|
8
|
Lualdi M, Fasano M. Statistical analysis of proteomics data: A review on feature selection. J Proteomics 2019; 198:18-26. [DOI: 10.1016/j.jprot.2018.12.004] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 11/27/2018] [Accepted: 12/05/2018] [Indexed: 12/19/2022]
|
9
|
Tian S, Wang C, Chang HH. A longitudinal feature selection method identifies relevant genes to distinguish complicated injury and uncomplicated injury over time. BMC Med Inform Decis Mak 2018; 18:115. [PMID: 30526581 PMCID: PMC6284265 DOI: 10.1186/s12911-018-0685-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Feature selection and gene set analysis are of increasing interest in the field of bioinformatics. While these two approaches have been developed for different purposes, we describe how some gene set analysis methods can be utilized to conduct feature selection. METHODS We adopted a gene set analysis method, the significance analysis of microarray gene set reduction (SAMGSR) algorithm, to carry out feature selection for longitudinal gene expression data. RESULTS Using a real-world application and simulated data, it is demonstrated that the proposed SAMGSR extension outperforms other relevant methods. In this study, we illustrate that a gene's expression profiles over time can be regarded as a gene set and then a suitable gene set analysis method can be utilized directly to select relevant genes associated with the phenotype of interest over time. CONCLUSIONS We believe this work will motivate more research to bridge feature selection and gene set analysis, with the development of novel algorithms capable of carrying out feature selection for longitudinal gene expression data.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71Xinmin Street, Changchun, 130021, Jilin, China.
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St, Lexington, KY, 40536, USA
| | - Howard H Chang
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA
| |
Collapse
|
10
|
Belorkar A, Vadigepalli R, Wong L. SPSNet: subpopulation-sensitive network-based analysis of heterogeneous gene expression data. BMC SYSTEMS BIOLOGY 2018; 12:28. [PMID: 29560831 PMCID: PMC5861489 DOI: 10.1186/s12918-018-0538-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Background Transcriptomic datasets often contain undeclared heterogeneity arising from biological variation such as diversity of disease subtypes, treatment subgroups, time-series gene expression, nested experimental conditions, as well as technical variation due to batch effects, platform differences in integrated meta-analyses, etc. However, current analysis approaches are primarily designed to handle comparisons between experimental conditions represented by homogeneous samples, thus precluding the discovery of underlying subphenotypes. Unsupervised methods for subtype identification are typically based on individual gene level analysis, which often result in irreproducible gene signatures for potential subtypes. Emerging methods to study heterogeneity have been largely developed in the context of single-cell datasets containing hundreds to thousands of samples, limiting their use to select contexts. Results We present a novel analysis method, SPSNet, which identifies subtype-specific gene expression signatures based on the activity of subnetworks in biological pathways. SPSNet identifies the gene subnetworks capturing the diversity of underlying biological mechanisms, indicating potential sample subphenotypes. In the presence of extrinsic or non-biological heterogeneity (e.g. batch effects), SPSNet identifies subnetworks that are particularly affected by such variation, thus helping eliminate factors irrelevant to the biology of the phenotypes under study. Conclusion Using multiple publicly available datasets, we illustrate that SPSNet is able to consistently uncover patterns within gene expression data that correspond to meaningful heterogeneity of various origins. We also demonstrate the performance of SPSNet as a sensitive and reliable tool for understanding the structure and nature of such heterogeneity. Electronic supplementary material The online version of this article (10.1186/s12918-018-0538-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Abha Belorkar
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore.,Daniel Baugh Institute for Functional Genomics and Computational Biology, Department of Pathology, Anatomy, and Cell Biology, Thomas Jefferson University, 1020 Locust Street, Philadelphia, 19107, Pennsylvania, USA
| | - Rajanikanth Vadigepalli
- Daniel Baugh Institute for Functional Genomics and Computational Biology, Department of Pathology, Anatomy, and Cell Biology, Thomas Jefferson University, 1020 Locust Street, Philadelphia, 19107, Pennsylvania, USA.
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore.
| |
Collapse
|
11
|
Teh DBL, Prasad A, Jiang W, Ariffin MZ, Khanna S, Belorkar A, Wong L, Liu X, All AH. Transcriptome Analysis Reveals Neuroprotective aspects of Human Reactive Astrocytes induced by Interleukin 1β. Sci Rep 2017; 7:13988. [PMID: 29070875 PMCID: PMC5656635 DOI: 10.1038/s41598-017-13174-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Accepted: 09/21/2017] [Indexed: 12/13/2022] Open
Abstract
Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Although it has been suggested that it confers neuroprotective effects, the exact genomic mechanism has not been explored. The prevailing dogma of the role of astrogliosis in inhibition of axonal regeneration has been challenged by recent findings in rodent model’s spinal cord injury, demonstrating its neuroprotection and axonal regeneration properties. We examined whether their neuroprotective and axonal regeneration potentials can be identify in human spinal cord reactive astrocytes in vitro. Here, reactive astrogliosis was induced with IL1β. Within 24 hours of IL1β induction, astrocytes acquired reactive characteristics. Transcriptome analysis of over 40000 transcripts of genes and analysis with PFSnet subnetwork revealed upregulation of chemokines and axonal permissive factors including FGF2, BDNF, and NGF. In addition, most genes regulating axonal inhibitory molecules, including ROBO1 and ROBO2 were downregulated. There was no increase in the gene expression of “Chondroitin Sulfate Proteoglycans” (CSPGs’) clusters. This suggests that reactive astrocytes may not be the main CSPG contributory factor in glial scar. PFSnet analysis also indicated an upregulation of “Axonal Guidance Signaling” pathway. Our result suggests that human spinal cord reactive astrocytes is potentially neuroprotective at an early onset of reactive astrogliosis.
Collapse
Affiliation(s)
- Daniel Boon Loong Teh
- Singapore Institute of Neurotechnology (SINAPSE), National University of Singapore, 28 Medical Drive, 5-COR, Singapore, 117456, Singapore
| | - Ankshita Prasad
- Department of Biomedical Engineering, National University of Singapore, E4, 4 Engineering Drive 3, Singapore, 117583, Singapore
| | - Wenxuan Jiang
- Department of Orthopaedic Surgery, National University of Singapore, 1E Kent Ridge Road, Singapore, 119228, Singapore
| | - Mohd Zacky Ariffin
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Sanjay Khanna
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Abha Belorkar
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore
| | - Xiaogang Liu
- Department of Chemistry, National University of Singapore, 3 Science Drive 3, Singapore, 117543, Singapore.
| | - Angelo H All
- Singapore Institute of Neurotechnology (SINAPSE), National University of Singapore, 28 Medical Drive, 5-COR, Singapore, 117456, Singapore. .,Department of Biomedical Engineering and Johns Hopkins School of Medicine, 701C Rutland Avenue 720, Baltimore, MD 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, 701C Rutland Avenue 720, Baltimore, MD 21205, USA.
| |
Collapse
|
12
|
Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using DNet. Methods 2017; 131:104-110. [PMID: 28807723 DOI: 10.1016/j.ymeth.2017.07.024] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 07/25/2017] [Accepted: 07/26/2017] [Indexed: 12/12/2022] Open
Abstract
It is critical to identify disease-specific subnetworks from the vastly available genome-wide gene expression data for elucidating how genes perform high-level biological functions together. Various algorithms have been developed for disease gene identification. However, the topological structure of the disease networks (or even the fraction of the networks) has been left largely unexplored. In this article, we present DNet, a method for the identification of significant disease subnetworks by integrating both the network structure and gene expression information. Our work will lead to the identification of missing key disease genes, which are be highly expressed in a disease-specific gene expression dataset. The experimental evaluation of our method on both the Leukemia and the Duchenne Muscular Dystrophy gene expression datasets show that DNet performs better than the existing state-of-the-art methods. In addition, literature supports were found for the discovered disease subnetworks in a case study.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Jin Chen
- Institute for Biomedical Informatics, University of Kentucky, Lexington, USA; Department of Internal Medicine, University of Kentucky, Lexington, USA; Department of Computer Science, University of Kentucky, Lexington, USA.
| |
Collapse
|
13
|
Abstract
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University , 92 Weijin Road, Tianjin 300072, China.,School of Biological Sciences, Nanyang Technological University , 60 Nanyang Drive, Singapore 637551.,Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore , 13 Computing Drive, Singapore 117417.,Department of Pathology, National University of Singapore , 5 Lower Kent Ridge Road, Singapore 119074
| |
Collapse
|
14
|
Goh WWB, Wong L. Class-paired Fuzzy SubNETs: A paired variant of the rank-based network analysis family for feature selection based on protein complexes. Proteomics 2017; 17:e1700093. [PMID: 28390171 DOI: 10.1002/pmic.201700093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 04/05/2017] [Indexed: 01/12/2023]
Abstract
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank-defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired-sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class-paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch-effect resistance as an additional evaluation criterion for feature-selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, P. R. China.,Department of Computer Science, National University of Singapore, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore.,Department of Pathology, National University of Singapore, Singapore
| |
Collapse
|
15
|
Identification of prognostic genes and gene sets for early-stage non-small cell lung cancer using bi-level selection methods. Sci Rep 2017; 7:46164. [PMID: 28387364 PMCID: PMC5384004 DOI: 10.1038/srep46164] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 03/09/2017] [Indexed: 12/18/2022] Open
Abstract
In contrast to feature selection and gene set analysis, bi-level selection is a process of selecting not only important gene sets but also important genes within those gene sets. Depending on the order of selections, a bi-level selection method can be classified into three categories – forward selection, which first selects relevant gene sets followed by the selection of relevant individual genes; backward selection which takes the reversed order; and simultaneous selection, which performs the two tasks simultaneously usually with the aids of a penalized regression model. To test the existence of subtype-specific prognostic genes for non-small cell lung cancer (NSCLC), we had previously proposed the Cox-filter method that examines the association between patients’ survival time after diagnosis with one specific gene, the disease subtypes, and their interaction terms. In this study, we further extend it to carry out forward and backward bi-level selection. Using simulations and a NSCLC application, we demonstrate that the forward selection outperforms the backward selection and other relevant algorithms in our setting. Both proposed methods are readily understandable and interpretable. Therefore, they represent useful tools for the researchers who are interested in exploring the prognostic value of gene expression data for specific subtypes or stages of a disease.
Collapse
|
16
|
Goh WWB, Wong L. Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics. BMC Genomics 2017; 18:142. [PMID: 28361693 PMCID: PMC5374662 DOI: 10.1186/s12864-017-3490-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background In proteomics, batch effects are technical sources of variation that confounds proper analysis, preventing effective deployment in clinical and translational research. Results Using simulated and real data, we demonstrate existing batch effect-correction methods do not always eradicate all batch effects. Worse still, they may alter data integrity, and introduce false positives. Moreover, although Principal component analysis (PCA) is commonly used for detecting batch effects. The principal components (PCs) themselves may be used as differential features, from which relevant differential proteins may be effectively traced. Batch effect are removable by identifying PCs highly correlated with batch but not class effect. However, neither PC-based nor existing batch effect-correction methods address well subtle batch effects, which are difficult to eradicate, and involve data transformation and/or projection which is error-prone. To address this, we introduce the concept of batch-effect resistant methods and demonstrate how such methods incorporating protein complexes are particularly resistant to batch effect without compromising data integrity. Conclusions Protein complex-based analyses are powerful, offering unparalleled differential protein-selection reproducibility and high prediction accuracy. We demonstrate for the first time their innate resistance against batch effects, even subtle ones. As complex-based analyses require no prior data transformation (e.g. batch-effect correction), data integrity is protected. Individual checks on top-ranked protein complexes confirm strong association with phenotype classes and not batch. Therefore, the constituent proteins of these complexes are more likely to be clinically relevant. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3490-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin, 300072, People's Republic of China. .,Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore, 117417, Singapore. .,Department of Pathology, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
17
|
Abstract
Background Gene expression data produced on high-throughput platforms such as microarrays is susceptible to much variation that obscures useful biological information. Therefore, preprocessing data with a suitable normalization method is necessary, and has a direct and massive impact on the quality of downstream data analysis. However, it is known that standard normalization methods perform poorly, specially in the presence of substantial batch effects and heterogeneity in gene expression data. Results We present Gene Fuzzy Score (GFS), a simple preprocessing technique, that is able to largely reduce obscuring variation while retaining useful biological information. Using four sets of publicly available datasets containing batch effects and heterogeneity, we compare GFS with three standard normalization techniques as well as raw gene expression. Each method is evaluated with respect to the quality, consistency, and biological coherence of its processed output. It is found that GFS outperforms other transformation techniques in all three aspects. Conclusion Our approach to preprocessing is a stronger alternative to popular normalization techniques. We demonstrate that it achieves the essential goal of preprocessing – it is effective at making expression values from multiple samples comparable, even when they are from separate platforms, in independent batches, or belong to a heterogeneous phenotype.
Collapse
Affiliation(s)
- Abha Belorkar
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Republic of Singapore.
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore, 117417, Republic of Singapore
| |
Collapse
|
18
|
Wang W, Sue ACH, Goh WWB. Feature selection in clinical proteomics: with great power comes great reproducibility. Drug Discov Today 2016; 22:912-918. [PMID: 27988358 DOI: 10.1016/j.drudis.2016.12.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Revised: 11/27/2016] [Accepted: 12/08/2016] [Indexed: 01/17/2023]
Abstract
In clinical proteomics, reproducible feature selection is unattainable given the standard statistical hypothesis-testing framework. This leads to irreproducible signatures with no diagnostic power. Instability stems from high P-value variability (p_var), which is inevitable and insolvable. The impact of p_var can be reduced via power increment, for example increasing sample size and measurement accuracy. However, these are not realistic solutions in practice. Instead, workarounds using existing data such as signal boosting transformation techniques and network-based statistical testing is more practical. Furthermore, it is useful to consider other metrics alongside P-values including confidence intervals, effect sizes and cross-validation accuracies to make informed inferences.
Collapse
Affiliation(s)
- Wei Wang
- School of Pharmaceutical Science and Technology, Tianjin University, China
| | - Andrew C-H Sue
- School of Pharmaceutical Science and Technology, Tianjin University, China
| | - Wilson W B Goh
- School of Pharmaceutical Science and Technology, Tianjin University, China; Department of Bioengineering, Tianjin University, China; Department of Computer Science, National University of Singapore, Singapore.
| |
Collapse
|
19
|
Goh WWB. Fuzzy-FishNET: a highly reproducible protein complex-based approach for feature selection in comparative proteomics. BMC Med Genomics 2016; 9:67. [PMID: 28117654 PMCID: PMC5260792 DOI: 10.1186/s12920-016-0228-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Background The hypergeometric enrichment analysis approach typically fares poorly in feature-selection stability due to its upstream reliance on the t-test to generate differential protein lists before testing for enrichment on a protein complex, subnetwork or gene group. Methods Swapping the t-test in favour of a fuzzy rank-based weight system similar to that used in network-based methods like Quantitative Proteomics Signature Profiling (QPSP), Fuzzy SubNets (FSNET) and paired FSNET (PFSNET) produces dramatic improvements. Results This approach, Fuzzy-FishNET, exhibits high precision-recall over three sets of simulated data (with simulated protein complexes) while excelling in feature-selection reproducibility on real data (based on evaluation with real protein complexes). Overlap comparisons with PFSNET shows Fuzzy-FishNET selects the most significant complexes, which are also strongly class-discriminative. Cross-validation further demonstrates Fuzzy-FishNET selects class-relevant protein complexes. Conclusions Based on evaluation with simulated and real datasets, Fuzzy-FishNET is a significant upgrade of the traditional hypergeometric enrichment approach and a powerful new entrant amongst comparative proteomics analysis methods. Electronic supplementary material The online version of this article (doi:10.1186/s12920-016-0228-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, People's Republic of China.
| |
Collapse
|
20
|
Goh WWB, Wong L. Integrating Networks and Proteomics: Moving Forward. Trends Biotechnol 2016; 34:951-959. [DOI: 10.1016/j.tibtech.2016.05.015] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 05/23/2016] [Accepted: 05/24/2016] [Indexed: 11/28/2022]
|
21
|
Zhang L, Wang L, Tian P, Tian S. Identification of Genes Discriminating Multiple Sclerosis Patients from Controls by Adapting a Pathway Analysis Method. PLoS One 2016; 11:e0165543. [PMID: 27846233 PMCID: PMC5112852 DOI: 10.1371/journal.pone.0165543] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 09/13/2016] [Indexed: 11/18/2022] Open
Abstract
The focus of analyzing data from microarray experiments has shifted from the identification of associated individual genes to that of associated biological pathways or gene sets. In bioinformatics, a feature selection algorithm is usually used to cope with the high dimensionality of microarray data. In addition to those algorithms that use the biological information contained within a gene set as a priori to facilitate the process of feature selection, various gene set analysis methods can be applied directly or modified readily for the purpose of feature selection. Significance analysis of microarray to gene-set reduction analysis (SAM-GSR) algorithm, a novel direction of gene set analysis, is one of such methods. Here, we explore the feature selection property of SAM-GSR and provide a modification to better achieve the goal of feature selection. In a multiple sclerosis (MS) microarray data application, both SAM-GSR and our modification of SAM-GSR perform well. Our results show that SAM-GSR can carry out feature selection indeed, and modified SAM-GSR outperforms SAM-GSR. Given pathway information is far from completeness, a statistical method capable of constructing biologically meaningful gene networks is of interest. Consequently, both SAM-GSR algorithms will be continuously revaluated in our future work, and thus better characterized.
Collapse
Affiliation(s)
- Lei Zhang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
- Department of Neurology, The Second Hospital of Jilin University, 218 Ziqiang Street, Changchun, Jilin, China, 130041
| | - Linlin Wang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Pu Tian
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin, China, 130021
| |
Collapse
|
22
|
Tian S, Chang HH, Wang C. Weighted-SAMGSR: combining significance analysis of microarray-gene set reduction algorithm with pathway topology-based weights to select relevant genes. Biol Direct 2016; 11:50. [PMID: 27681389 PMCID: PMC5041498 DOI: 10.1186/s13062-016-0152-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 09/20/2016] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method. METHODS In SAMGSR, whether a gene is selected is mainly determined by its expression difference between the phenotypes, and partially by the number of pathways to which this gene belongs. It ignores the topology information among pathways. In this study, we propose a weighted version of the SAMGSR algorithm by constructing weights based on the connectivity among genes and then combing these weights with the test statistics. RESULTS Using both simulated and real-world data, we evaluate the performance of the proposed SAMGSR extension and demonstrate that the weighted version outperforms its original version. CONCLUSIONS: To conclude, the additional gene connectivity information does faciliatate feature selection. REVIEWERS This article was reviewed by Drs. Limsoon Wong, Lev Klebanov, and, I. King Jordan.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71Xinmin Street, Changchun, Jilin, China, 130021. .,School of Mathematics, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012.
| | - Howard H Chang
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St., Lexington, KY, 40536, USA
| |
Collapse
|
23
|
Goh WWB, Wong L. Advancing Clinical Proteomics via Analysis Based on Biological Complexes: A Tale of Five Paradigms. J Proteome Res 2016; 15:3167-79. [DOI: 10.1021/acs.jproteome.6b00402] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Wilson Wen Bin Goh
- School
of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China
- Department
of Computer Science, National University of Singapore, 13 Computing
Drive, Singapore 117417
| | - Limsoon Wong
- Department
of Computer Science, National University of Singapore, 13 Computing
Drive, Singapore 117417
- Department
of Pathology, National University of Singapore, 5 Lower Kent Ridge Road, Singapore 117417
| |
Collapse
|
24
|
Goh WWB, Wong L. Evaluating feature-selection stability in next-generation proteomics. J Bioinform Comput Biol 2016; 14:1650029. [PMID: 27640811 DOI: 10.1142/s0219720016500293] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Identifying reproducible yet relevant features is a major challenge in biological research. This is well documented in genomics data. Using a proposed set of three reliability benchmarks, we find that this issue exists also in proteomics for commonly used feature-selection methods, e.g. [Formula: see text]-test and recursive feature elimination. Moreover, due to high test variability, selecting the top proteins based on [Formula: see text]-value ranks - even when restricted to high-abundance proteins - does not improve reproducibility. Statistical testing based on networks are believed to be more robust, but this does not always hold true: The commonly used hypergeometric enrichment that tests for enrichment of protein subnets performs abysmally due to its dependence on unstable protein pre-selection steps. We demonstrate here for the first time the utility of a novel suite of network-based algorithms called ranked-based network algorithms (RBNAs) on proteomics. These have originally been introduced and tested extensively on genomics data. We show here that they are highly stable, reproducible and select relevant features when applied to proteomics data. It is also evident from these results that use of statistical feature testing on protein expression data should be executed with due caution. Careless use of networks does not resolve poor-performance issues, and can even mislead. We recommend augmenting statistical feature-selection methods with concurrent analysis on stability and reproducibility to improve the quality of the selected features prior to experimental validation.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- 1 School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin 300072, China.,2 Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 Singapore
| | - Limsoon Wong
- 1 School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin 300072, China.,2 Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 Singapore
| |
Collapse
|
25
|
Design principles for clinical network-based proteomics. Drug Discov Today 2016; 21:1130-8. [DOI: 10.1016/j.drudis.2016.05.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2015] [Revised: 04/18/2016] [Accepted: 05/20/2016] [Indexed: 01/10/2023]
|
26
|
Classification of Non-Small Cell Lung Cancer Using Significance Analysis of Microarray-Gene Set Reduction Algorithm. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2491671. [PMID: 27446945 PMCID: PMC4944087 DOI: 10.1155/2016/2491671] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Revised: 05/09/2016] [Accepted: 06/05/2016] [Indexed: 01/15/2023]
Abstract
Among non-small cell lung cancer (NSCLC), adenocarcinoma (AC), and squamous cell carcinoma (SCC) are two major histology subtypes, accounting for roughly 40% and 30% of all lung cancer cases, respectively. Since AC and SCC differ in their cell of origin, location within the lung, and growth pattern, they are considered as distinct diseases. Gene expression signatures have been demonstrated to be an effective tool for distinguishing AC and SCC. Gene set analysis is regarded as irrelevant to the identification of gene expression signatures. Nevertheless, we found that one specific gene set analysis method, significance analysis of microarray-gene set reduction (SAMGSR), can be adopted directly to select relevant features and to construct gene expression signatures. In this study, we applied SAMGSR to a NSCLC gene expression dataset. When compared with several novel feature selection algorithms, for example, LASSO, SAMGSR has equivalent or better performance in terms of predictive ability and model parsimony. Therefore, SAMGSR is a feature selection algorithm, indeed. Additionally, we applied SAMGSR to AC and SCC subtypes separately to discriminate their respective stages, that is, stage II versus stage I. Few overlaps between these two resulting gene signatures illustrate that AC and SCC are technically distinct diseases. Therefore, stratified analyses on subtypes are recommended when diagnostic or prognostic signatures of these two NSCLC subtypes are constructed.
Collapse
|
27
|
Engchuan W, Meechai A, Tongsima S, Doungpan N, Chan JH. Gene-set activity toolbox (GAT): A platform for microarray-based cancer diagnosis using an integrative gene-set analysis approach. J Bioinform Comput Biol 2016; 14:1650015. [PMID: 27102089 DOI: 10.1142/s0219720016500153] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Cancer is a complex disease that cannot be diagnosed reliably using only single gene expression analysis. Using gene-set analysis on high throughput gene expression profiling controlled by various environmental factors is a commonly adopted technique used by the cancer research community. This work develops a comprehensive gene expression analysis tool (gene-set activity toolbox: (GAT)) that is implemented with data retriever, traditional data pre-processing, several gene-set analysis methods, network visualization and data mining tools. The gene-set analysis methods are used to identify subsets of phenotype-relevant genes that will be used to build a classification model. To evaluate GAT performance, we performed a cross-dataset validation study on three common cancers namely colorectal, breast and lung cancers. The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance. GAT can be accessed from http://gat.sit.kmutt.ac.th where GAT's java library for gene-set analysis, simple classification and a database with three cancer benchmark datasets can be downloaded.
Collapse
Affiliation(s)
- Worrawat Engchuan
- 1 Data and Knowledge Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| | - Asawin Meechai
- 2 Department of Chemical Engineering, Faculty of Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| | - Sissades Tongsima
- 3 Biostatistics and Informatics Laboratory, Genome Technology Research Unit, National Center for Genetic Engineering and Biotechnology
| | - Narumol Doungpan
- 4 Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| | - Jonathan H Chan
- 1 Data and Knowledge Engineering Laboratory, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| |
Collapse
|
28
|
Long J, Liu Z, Wu X, Xu Y, Ge C. Screening for genes and subnetworks associated with pancreatic cancer based on the gene expression profile. Mol Med Rep 2016; 13:3779-86. [PMID: 27035224 PMCID: PMC4838159 DOI: 10.3892/mmr.2016.5007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 02/17/2016] [Indexed: 12/27/2022] Open
Abstract
The present study aimed to screen for potential genes and subnetworks associated with pancreatic cancer (PC) using the gene expression profile. The expression profile GSE 16515 was downloaded from the Gene Expression Omnibus database, which included 36 PC tissue samples and 16 normal samples. Limma package in R language was used to screen differentially expressed genes (DEGs), which were grouped as up‑ and downregulated genes. Then, PFSNet was applied to perform subnetwork analysis for all the DEGs. Moreover, Gene Ontology (GO) and REACTOME pathway enrichment analysis of up‑ and downregulated genes was performed, followed by protein‑protein interaction (PPI) network construction using Search Tool for the Retrieval of Interacting Genes Search Tool for the Retrieval of Interacting Genes. In total, 1,989 DEGs including 1,461 up‑ and 528 downregulated genes were screened out. Subnetworks including pancreatic cancer in PC tissue samples and intercellular adhesion in normal samples were identified, respectively. A total of 8 significant REACTOME pathways for upregulated DEGs, such as hemostasis and cell cycle, mitotic were identified. Moreover, 4 significant REACTOME pathways for downregulated DEGs, including regulation of β‑cell development and transmembrane transport of small molecules were screened out. Additionally, DEGs with high connectivity degrees, such as CCNA2 (cyclin A2) and PBK (PDZ binding kinase), of the module in the protein‑protein interaction network were mainly enriched with cell‑division cycle. CCNA2 and PBK of the module and their relative pathway cell‑division cycle, and two subnetworks (pancreatic cancer and intercellular adhesion subnetworks) may be pivotal for further understanding of the molecular mechanism of PC.
Collapse
Affiliation(s)
- Jin Long
- Department of General Surgery, The First Hospital of China Medical University, Shenyang, Liaoning 110001, P.R. China
| | - Zhe Liu
- Department of General Surgery, The First Hospital of China Medical University, Shenyang, Liaoning 110001, P.R. China
| | - Xingda Wu
- Department of General Surgery, The First Hospital of China Medical University, Shenyang, Liaoning 110001, P.R. China
| | - Yuanhong Xu
- Department of General Surgery, The First Hospital of China Medical University, Shenyang, Liaoning 110001, P.R. China
| | - Chunlin Ge
- Department of General Surgery, The First Hospital of China Medical University, Shenyang, Liaoning 110001, P.R. China
| |
Collapse
|
29
|
Li B, Sun Z, He Q, Zhu Y, Qin ZS. Bayesian inference with historical data-based informative priors improves detection of differentially expressed genes. Bioinformatics 2016; 32:682-9. [PMID: 26519502 DOI: 10.1093/bioinformatics/btv631] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 10/26/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Modern high-throughput biotechnologies such as microarray are capable of producing a massive amount of information for each sample. However, in a typical high-throughput experiment, only limited number of samples were assayed, thus the classical 'large p, small n' problem. On the other hand, rapid propagation of these high-throughput technologies has resulted in a substantial collection of data, often carried out on the same platform and using the same protocol. It is highly desirable to utilize the existing data when performing analysis and inference on a new dataset. RESULTS Utilizing existing data can be carried out in a straightforward fashion under the Bayesian framework in which the repository of historical data can be exploited to build informative priors and used in new data analysis. In this work, using microarray data, we investigate the feasibility and effectiveness of deriving informative priors from historical data and using them in the problem of detecting differentially expressed genes. Through simulation and real data analysis, we show that the proposed strategy significantly outperforms existing methods including the popular and state-of-the-art Bayesian hierarchical model-based approaches. Our work illustrates the feasibility and benefits of exploiting the increasingly available genomics big data in statistical inference and presents a promising practical strategy for dealing with the 'large p, small n' problem. AVAILABILITY AND IMPLEMENTATION Our method is implemented in R package IPBT, which is freely available from https://github.com/benliemory/IPBT CONTACT: yuzhu@purdue.edu; zhaohui.qin@emory.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Zhaonan Sun
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA and
| | - Qing He
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Yu Zhu
- Department of Statistics, Purdue University, West Lafayette, IN 47906, USA and
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA, Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA 30322, USA
| |
Collapse
|
30
|
Bin Goh WW, Guo T, Aebersold R, Wong L. Quantitative proteomics signature profiling based on network contextualization. Biol Direct 2015; 10:71. [PMID: 26666224 PMCID: PMC4678536 DOI: 10.1186/s13062-015-0098-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2015] [Accepted: 11/30/2015] [Indexed: 12/02/2022] Open
Abstract
Background We present a network-based method, namely quantitative proteomic signature profiling (qPSP) that improves the biological content of proteomic data by converting protein expressions into hit-rates in protein complexes. Results We demonstrate, using two clinical proteomics datasets, that qPSP produces robust discrimination between phenotype classes (e.g. normal vs. disease) and uncovers phenotype-relevant protein complexes. Regardless of acquisition paradigm, comparisons of qPSP against conventional methods (e.g. t-test or hypergeometric test) demonstrate that it produces more stable and consistent predictions, even at small sample size. We show that qPSP is theoretically robust to noise, and that this robustness to noise is also observable in practice. Comparative analysis of hit-rates and protein expressions in significant complexes reveals that hit-rates are a useful means of summarizing differential behavior in a complex-specific manner. Conclusions Given qPSP’s ability to discriminate phenotype classes even at small sample sizes, high robustness to noise, and better summary statistics, it can be deployed towards analysis of highly heterogeneous clinical proteomics data. Reviewers This article was reviewed by Frank Eisenhaber and Sebastian Maurer-Stroh. Open peer review Reviewed by Frank Eisenhaber and Sebastian Maurer-Stroh. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0098-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- School of Pharmaceutical Science and Technology, Tianjin University, 92 Weijin Road, Tianjin City, 300072, China. .,Center for Interdisciplinary Cardiovascular Sciences, Harvard Medical School, Boston, USA. .,Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland. .,School of Computing, National University of Singapore, Singapore, Singapore.
| | - Tiannan Guo
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.
| | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland. .,Faculty of Science, University of Zurich, Zurich, Switzerland.
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
31
|
Lim K, Li Z, Choi KP, Wong L. A quantum leap in the reproducibility, precision, and sensitivity of gene expression profile analysis even when sample size is extremely small. J Bioinform Comput Biol 2015; 13:1550018. [DOI: 10.1142/s0219720015500183] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Transcript-level quantification is often measured across two groups of patients to aid the discovery of biomarkers and detection of biological mechanisms involving these biomarkers. Statistical tests lack power and false discovery rate is high when sample size is small. Yet, many experiments have very few samples (≤ 5). This creates the impetus for a method to discover biomarkers and mechanisms under very small sample sizes. We present a powerful method, ESSNet, that is able to identify subnetworks consistently across independent datasets of the same disease phenotypes even under very small sample sizes. The key idea of ESSNet is to fragment large pathways into smaller subnetworks and compute a statistic that discriminates the subnetworks in two phenotypes. We do not greedily select genes to be included based on differential expression but rely on gene-expression-level ranking within a phenotype, which is shown to be stable even under extremely small sample sizes. We test our subnetworks on null distributions obtained by array rotation; this preserves the gene–gene correlation structure and is suitable for datasets with small sample size allowing us to consistently predict relevant subnetworks even when sample size is small. For most other methods, this consistency drops to less than 10% when we test them on datasets with only two samples from each phenotype, whereas ESSNet is able to achieve an average consistency of 58% (72% when we consider genes within the subnetworks) and continues to be superior when sample size is large. We further show that the subnetworks identified by ESSNet are highly correlated to many references in the biological literature. ESSNet and supplementary material are available at: http://compbio.ddns.comp.nus.edu.sg:8080/essnet .
Collapse
Affiliation(s)
- Kevin Lim
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417, Singapore
| | - Zhenhua Li
- Department of Pediatrics, National University of Singapore, 10 Medical Drive, Singapore 117597, Singapore
| | - Kwok Pui Choi
- Department of Statistics and Applied Probability, National University of Singapore, 6 Science Drive 2, Singapore 117546, Singapore
| | - Limsoon Wong
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417, Singapore
| |
Collapse
|
32
|
Saha A, Jeon M, Tan AC, Kang J. iCOSSY: An Online Tool for Context-Specific Subnetwork Discovery from Gene Expression Data. PLoS One 2015; 10:e0131656. [PMID: 26147457 PMCID: PMC4492968 DOI: 10.1371/journal.pone.0131656] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Accepted: 06/04/2015] [Indexed: 12/22/2022] Open
Abstract
Pathway analyses help reveal underlying molecular mechanisms of complex biological phenotypes. Biologists tend to perform multiple pathway analyses on the same dataset, as there is no single answer. It is often inefficient for them to implement and/or install all the algorithms by themselves. Online tools can help the community in this regard. Here we present an online gene expression analytical tool called iCOSSY which implements a novel pathway-based COntext-specific Subnetwork discoverY (COSSY) algorithm. iCOSSY also includes a few modifications of COSSY to increase its reliability and interpretability. Users can upload their gene expression datasets, and discover important subnetworks of closely interacting molecules to differentiate between two phenotypes (context). They can also interactively visualize the resulting subnetworks. iCOSSY is a web server that finds subnetworks that are differentially expressed in two phenotypes. Users can visualize the subnetworks to understand the biology of the difference.
Collapse
Affiliation(s)
- Ashis Saha
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
| | - Minji Jeon
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
| | - Aik Choon Tan
- Department of Medicine/Medical Oncology, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Korea
- * E-mail:
| |
Collapse
|