1
|
Trasierras AM, Luna JM, Ventura S. Improving the understanding of cancer in a descriptive way: An emerging pattern mining‐based approach. INT J INTELL SYST 2021. [DOI: 10.1002/int.22503] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
| | - José María Luna
- Department of Computer Science and Numerical Analysis, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI) University of Cordoba Córdoba Spain
| | - Sebastián Ventura
- Department of Computer Science and Numerical Analysis, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI) University of Cordoba Córdoba Spain
| |
Collapse
|
2
|
Arostegui I, Gonzalez N, Fernández-de-Larrea N, Lázaro-Aramburu S, Baré M, Redondo M, Sarasqueta C, Garcia-Gutierrez S, Quintana JM. Combining statistical techniques to predict postsurgical risk of 1-year mortality for patients with colon cancer. Clin Epidemiol 2018; 10:235-251. [PMID: 29563837 PMCID: PMC5846756 DOI: 10.2147/clep.s146729] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Introduction Colorectal cancer is one of the most frequently diagnosed malignancies and a common cause of cancer-related mortality. The aim of this study was to develop and validate a clinical predictive model for 1-year mortality among patients with colon cancer who survive for at least 30 days after surgery. Methods Patients diagnosed with colon cancer who had surgery for the first time and who survived 30 days after the surgery were selected prospectively. The outcome was mortality within 1 year. Random forest, genetic algorithms and classification and regression trees were combined in order to identify the variables and partition points that optimally classify patients by risk of mortality. The resulting decision tree was categorized into four risk categories. Split-sample and bootstrap validation were performed. ClinicalTrials.gov Identifier: NCT02488161. Results A total of 1945 patients were enrolled in the study. The variables identified as the main predictors of 1-year mortality were presence of residual tumor, American Society of Anesthesiologists Physical Status Classification System risk score, pathologic tumor staging, Charlson Comorbidity Index, intraoperative complications, adjuvant chemotherapy and recurrence of tumor. The model was internally validated; area under the receiver operating characteristic curve (AUC) was 0.896 in the derivation sample and 0.835 in the validation sample. Risk categorization leads to AUC values of 0.875 and 0.832 in the derivation and validation samples, respectively. Optimal cut-off point of estimated risk had a sensitivity of 0.889 and a specificity of 0.758. Conclusion The decision tree was a simple, interpretable, valid and accurate prediction rule of 1-year mortality among colon cancer patients who survived for at least 30 days after surgery.
Collapse
Affiliation(s)
- Inmaculada Arostegui
- Department of Applied Mathematics, Statistics and Operations Research, University of the Basque Country UPV/EHU, Leioa, Bizkaia, Spain.,Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Basque Center for Applied Mathematics - BCAM, Bilbao, Bizkaia, Spain
| | - Nerea Gonzalez
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Research Unit, Galdakao-Usansolo Hospital, Galdakao, Bizkaia, Spain
| | - Nerea Fernández-de-Larrea
- Environmental and Cancer Epidemiology Unit, National Center of Epidemiology, Instituto de Salud Carlos III, Madrid, Spain.,Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, Spain
| | | | - Marisa Baré
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Clinical Epidemiology and Cancer Screening Unit, Parc Taulí Sabadell-Hospital Universitari, UAB, Sabadell, Barcelona, Spain
| | - Maximino Redondo
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Research Unit, Costa del Sol Hospital, Marbella, Malaga, Spain
| | - Cristina Sarasqueta
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Research Unit, Donostia Hospital, Donostia-San Sebastián, Gipuzkoa, Spain
| | - Susana Garcia-Gutierrez
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Research Unit, Galdakao-Usansolo Hospital, Galdakao, Bizkaia, Spain
| | - José M Quintana
- Health Services Research on Chronic Patients Network (REDISSEC), Galdakao, Bizkaia, Spain.,Research Unit, Galdakao-Usansolo Hospital, Galdakao, Bizkaia, Spain
| | | |
Collapse
|
3
|
Raddatz BB, Spitzbarth I, Matheis KA, Kalkuhl A, Deschl U, Baumgärtner W, Ulrich R. Microarray-Based Gene Expression Analysis for Veterinary Pathologists: A Review. Vet Pathol 2017. [PMID: 28641485 DOI: 10.1177/0300985817709887] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
High-throughput, genome-wide transcriptome analysis is now commonly used in all fields of life science research and is on the cusp of medical and veterinary diagnostic application. Transcriptomic methods such as microarrays and next-generation sequencing generate enormous amounts of data. The pathogenetic expertise acquired from understanding of general pathology provides veterinary pathologists with a profound background, which is essential in translating transcriptomic data into meaningful biological knowledge, thereby leading to a better understanding of underlying disease mechanisms. The scientific literature concerning high-throughput data-mining techniques usually addresses mathematicians or computer scientists as the target audience. In contrast, the present review provides the reader with a clear and systematic basis from a veterinary pathologist's perspective. Therefore, the aims are (1) to introduce the reader to the necessary methodological background; (2) to introduce the sequential steps commonly performed in a microarray analysis including quality control, annotation, normalization, selection of differentially expressed genes, clustering, gene ontology and pathway analysis, analysis of manually selected genes, and biomarker discovery; and (3) to provide references to publically available and user-friendly software suites. In summary, the data analysis methods presented within this review will enable veterinary pathologists to analyze high-throughput transcriptome data obtained from their own experiments, supplemental data that accompany scientific publications, or public repositories in order to obtain a more in-depth insight into underlying disease mechanisms.
Collapse
Affiliation(s)
- Barbara B Raddatz
- 1 Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany.,2 Center of Systems Neuroscience, Hannover, Germany
| | - Ingo Spitzbarth
- 1 Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany.,2 Center of Systems Neuroscience, Hannover, Germany
| | - Katja A Matheis
- 3 Department of Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co KG, Biberach (Riß), Germany
| | - Arno Kalkuhl
- 3 Department of Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co KG, Biberach (Riß), Germany
| | - Ulrich Deschl
- 3 Department of Nonclinical Drug Safety, Boehringer Ingelheim Pharma GmbH & Co KG, Biberach (Riß), Germany
| | - Wolfgang Baumgärtner
- 1 Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany.,2 Center of Systems Neuroscience, Hannover, Germany
| | - Reiner Ulrich
- 1 Department of Pathology, University of Veterinary Medicine Hannover, Hannover, Germany.,2 Center of Systems Neuroscience, Hannover, Germany.,4 Department of Experimental Animal Facilities and Biorisk Management, Friedrich-Loeffler-Institute, Greifswald, Germany
| |
Collapse
|
4
|
Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes. Sci Rep 2016; 6:21223. [PMID: 26868667 PMCID: PMC4751475 DOI: 10.1038/srep21223] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 01/19/2016] [Indexed: 02/02/2023] Open
Abstract
Most protein complex detection methods utilize unsupervised techniques to cluster densely connected nodes in a protein-protein interaction (PPI) network, in spite of the fact that many true complexes are not dense subgraphs. Supervised methods have been proposed recently, but they do not answer why a group of proteins are predicted as a complex, and they have not investigated how to detect new complexes of one species by training the model on the PPI data of another species. We propose a novel supervised method to address these issues. The key idea is to discover emerging patterns (EPs), a type of contrast pattern, which can clearly distinguish true complexes from random subgraphs in a PPI network. An integrative score of EPs is defined to measure how likely a subgraph of proteins can form a complex. New complexes thus can grow from our seed proteins by iteratively updating this score. The performance of our method is tested on eight benchmark PPI datasets and compared with seven unsupervised methods, two supervised and one semi-supervised methods under five standards to assess the quality of the predicted complexes. The results show that in most cases our method achieved a better performance, sometimes significantly.
Collapse
|
5
|
Liu X, Wu J, Gu F, Wang J, He Z. Discriminative pattern mining and its applications in bioinformatics. Brief Bioinform 2014; 16:884-900. [PMID: 25433466 DOI: 10.1093/bib/bbu042] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Indexed: 11/13/2022] Open
Abstract
Discriminative pattern mining is one of the most important techniques in data mining. This challenging task is concerned with finding a set of patterns that occur with disproportionate frequency in data sets with various class labels. Such patterns are of great value for group difference detection and classifier construction. Research on finding interesting discriminative patterns in class-labeled data evolves rapidly and lots of algorithms have been proposed to specifically address this problem. Discriminative pattern mining techniques have proven their considerable value in biological data analysis. The archetypical applications in bioinformatics include phosphorylation motif discovery, differentially expressed gene identification, discriminative genotype pattern detection, etc. In this article, we present an overview of discriminative pattern mining and the corresponding effective methods, and subsequently we illustrate their applications to tackling the bioinformatics problems. In the end, we give a general discussion of potential challenges and future work for this task.
Collapse
|
6
|
Geman D, Ochs M, Price ND, Tomasetti C, Younes L. An argument for mechanism-based statistical inference in cancer. Hum Genet 2014; 134:479-95. [PMID: 25381197 DOI: 10.1007/s00439-014-1501-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 10/14/2014] [Indexed: 01/07/2023]
Abstract
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.
Collapse
Affiliation(s)
- Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21210, USA,
| | | | | | | | | |
Collapse
|
7
|
Afsari B, Braga-Neto UM, Geman D. Rank discriminants for predicting phenotypes from RNA expression. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas738] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
8
|
Ulfenborg B, Klinga-Levan K, Olsson B. Classification of tumor samples from expression data using decision trunks. Cancer Inform 2013; 12:53-66. [PMID: 23467331 PMCID: PMC3579425 DOI: 10.4137/cin.s10356] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.
Collapse
Affiliation(s)
- Benjamin Ulfenborg
- Systems Biology Research Centre, School of Life Sciences, University of Skövde, Skövde, Sweden
| | | | | |
Collapse
|
9
|
Hengpraprohm S. GA-Based Classifier with SNR Weighted Features for Cancer Microarray Data Classification. ACTA ACUST UNITED AC 2013. [DOI: 10.12720/ijsps.1.1.29-33] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
10
|
Genotype and phenotypes of an intestine-adapted Escherichia coli K-12 mutant selected by animal passage for superior colonization. Infect Immun 2011; 79:2430-9. [PMID: 21422176 DOI: 10.1128/iai.01199-10] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
We previously isolated a spontaneous mutant of Escherichia coli K-12, strain MG1655, following passage through the streptomycin-treated mouse intestine, that has colonization traits superior to the wild-type parent strain (M. P. Leatham et al., Infect. Immun. 73:8039-8049, 2005). This intestine-adapted strain (E. coli MG1655*) grew faster on several different carbon sources than the wild type and was nonmotile due to deletion of the flhD gene. We now report the results of several high-throughput genomic analysis approaches to further characterize E. coli MG1655*. Whole-genome pyrosequencing did not reveal any changes on its genome, aside from the deletion at the flhDC locus, that could explain the colonization advantage of E. coli MG1655*. Microarray analysis revealed modest yet significant induction of catabolic gene systems across the genome in both E. coli MG1655* and an isogenic flhD mutant constructed in the laboratory. Catabolome analysis with Biolog GN2 microplates revealed an enhanced ability of both E. coli MG1655* and the isogenic flhD mutant to oxidize a variety of carbon sources. The results show that intestine-adapted E. coli MG1655* is more fit than the wild type for intestinal colonization, because loss of FlhD results in elevated expression of genes involved in carbon and energy metabolism, resulting in more efficient carbon source utilization and a higher intestinal population. Hence, mutations that enhance metabolic efficiency confer a colonization advantage.
Collapse
|
11
|
Eddy JA, Sung J, Geman D, Price ND. Relative expression analysis for molecular cancer diagnosis and prognosis. Technol Cancer Res Treat 2010; 9:149-59. [PMID: 20218737 DOI: 10.1177/153303461000900204] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
The enormous amount of biomolecule measurement data generated from high-throughput technologies has brought an increased need for computational tools in biological analyses. Such tools can enhance our understanding of human health and genetic diseases, such as cancer, by accurately classifying phenotypes, detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression. In the case of gene expression microarray data, standard statistical learning methods have been used to identify classifiers that can accurately distinguish disease phenotypes. However, these mathematical prediction rules are often highly complex, and they lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. In this review, we survey a powerful collection of computational methods for analyzing transcriptomic microarray data that address these limitations. Relative Expression Analysis (RXA) is based only on the relative orderings among the expressions of a small number of genes. Specifically, we provide a description of the first and simplest example of RXA, the K-TSP classifier, which is based on _ pairs of genes; the case K = 1 is the TSP classifier. Given their simplicity and ease of biological interpretation, as well as their invariance to data normalization and parameter-fitting, these classifiers have been widely applied in aiding molecular diagnostics in a broad range of human cancers. We review several studies which demonstrate accurate classification of disease phenotypes (e.g., cancer vs. normal), cancer subclasses (e.g., AML vs. ALL, GIST vs. LMS), disease outcomes (e.g., metastasis, survival), and diverse human pathologies assayed through blood-borne leukocytes. The studies presented demonstrate that RXA-specifically the TSP and K-TSP classifiers-is a promising new class of computational methods for analyzing high-throughput data, and has the potential to significantly contribute to molecular cancer diagnosis and prognosis.
Collapse
Affiliation(s)
- James A Eddy
- Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| | | | | | | |
Collapse
|
12
|
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data. Comput Biol Med 2010; 40:519-24. [DOI: 10.1016/j.compbiomed.2010.03.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Revised: 01/09/2010] [Accepted: 03/22/2010] [Indexed: 11/22/2022]
|
13
|
Geurts P, Irrthum A, Wehenkel L. Supervised learning with decision tree-based methods in computational and systems biology. MOLECULAR BIOSYSTEMS 2009; 5:1593-605. [PMID: 20023720 DOI: 10.1039/b907946g] [Citation(s) in RCA: 124] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
At the intersection between artificial intelligence and statistics, supervised learning allows algorithms to automatically build predictive models from just observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the review is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the review provides a survey of their applications in the context of computational and systems biology.
Collapse
Affiliation(s)
- Pierre Geurts
- Department of EE and CS & GIGA-Research, University of Liège, Belgium.
| | | | | |
Collapse
|
14
|
Tang LJ, Du W, Fu HY, Jiang JH, Wu HL, Shen GL, Yu RQ. New Variable Selection Method Using Interval Segmentation Purity with Application to Blockwise Kernel Transform Support Vector Machine Classification of High-Dimensional Microarray Data. J Chem Inf Model 2009; 49:2002-9. [DOI: 10.1021/ci900032q] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Li-Juan Tang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Wen Du
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Hai-Yan Fu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Jian-Hui Jiang
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Hai-Long Wu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Guo-Li Shen
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| | - Ru-Qin Yu
- State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P. R. China
| |
Collapse
|
15
|
Classification and regression tree (CART) analyses of genomic signatures reveal sets of tetramers that discriminate temperature optima of archaea and bacteria. ARCHAEA-AN INTERNATIONAL MICROBIOLOGICAL JOURNAL 2009; 2:159-67. [PMID: 19054742 DOI: 10.1155/2008/829730] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Classification and regression tree (CART) analysis was applied to genome-wide tetranucleotide frequencies (genomic signatures) of 195 archaea and bacteria. Although genomic signatures have typically been used to classify evolutionary divergence, in this study, convergent evolution was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (nonlinear) qualities of genomes may reflect certain environmental conditions (such as temperature) in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results.
Collapse
|
16
|
Classification tree based protein structure distances for testing sequence–structure correlation. Comput Biol Med 2008; 38:469-74. [DOI: 10.1016/j.compbiomed.2008.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2007] [Accepted: 01/15/2008] [Indexed: 11/21/2022]
|
17
|
Sanden SV, Lin D, Burzykowski T. Performance of Gene Selection and Classification Methods in a Microarray Setting: A Simulation Study. COMMUN STAT-SIMUL C 2008. [DOI: 10.1080/03610910701792554] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
18
|
Abstract
Within the growing body of proteomics studies, issues addressing problems of ecotoxicology are on the rise. Generally speaking, ecotoxicology uses quantitative expression changes of distinct proteins known to be involved in toxicological responses as biomarkers. Unlike these directed approaches, proteomics examines how multiple expression changes are associated with a contamination that is suspected to be detrimental. Consequently, proteins involved in toxicological responses that have not been described previously may be revealed. Following identification of key proteins indicating exposure or effect, proteomics can potentially be employed in environmental risk assessment. To this end, bioinformatics may unveil protein patterns specific to an environmental stress that would constitute a classifier able to distinguish an exposure from a control state. The combined use of sets of marker proteins associated with a given pollution impact may prove to be more reliable, as they are based not only on a few unique markers which are measured independently, but reflect the complexity of a toxicological response. Such a proteomic pattern might also integrate some of the already established biomarkers of environmental toxicity. Proteomics applications in ecotoxicology may also comprise functional examination of known classes of proteins, such as glutathione transferases or metallothioneins, to elucidate their toxicological responses.
Collapse
Affiliation(s)
- Tiphaine Monsinjon
- Laboratoire d'Ecotoxicologie - Milieux Aquatiques, Université du Havre, Le Havre, France
| | | |
Collapse
|
19
|
Li J, Yang Q. Strong Compound-Risk Factors: Efficient Discovery Through Emerging Patterns and Contrast Sets. ACTA ACUST UNITED AC 2007; 11:544-52. [PMID: 17912971 DOI: 10.1109/titb.2007.891163] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound-risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound-risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes.
Collapse
Affiliation(s)
- Jinyan Li
- Institute for Infocomm Research, Singapore 119613.
| | | |
Collapse
|
20
|
Chich JF, David O, Villers F, Schaeffer B, Lutomski D, Huet S. Statistics for proteomics: Experimental design and 2-DE differential analysis. J Chromatogr B Analyt Technol Biomed Life Sci 2007; 849:261-72. [PMID: 17081811 DOI: 10.1016/j.jchromb.2006.09.033] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2006] [Revised: 08/25/2006] [Accepted: 09/08/2006] [Indexed: 11/24/2022]
Abstract
Proteomics relies on the separation of complex protein mixtures using bidimensional electrophoresis. This approach is largely used to detect the expression variations of proteins prepared from two or more samples. Recently, attention was drawn on the reliability of the results published in literature. Among the critical points identified were experimental design, differential analysis and the problem of missing data, all problems where statistics can be of help. Using examples and terms understandable by biologists, we describe how a collaboration between biologists and statisticians can improve reliability of results and confidence in conclusions.
Collapse
Affiliation(s)
- Jean-François Chich
- INRA, Biologie Physico-Chimique des Prions, VIM 78352 Jouy-en-Josas Cedex, France.
| | | | | | | | | | | |
Collapse
|
21
|
Zintzaras E, Bai M, Douligeris C, Kowald A, Kanavaros P. A tree-based decision rule for identifying profile groups of cases without predefined classes: application in diffuse large B-cell lymphomas. Comput Biol Med 2006; 37:637-41. [PMID: 16895724 DOI: 10.1016/j.compbiomed.2006.06.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2005] [Revised: 05/06/2006] [Accepted: 06/05/2006] [Indexed: 10/24/2022]
Abstract
In this paper, we examined the utility of a forward growing classification tree as a supplement to cluster analysis for deriving a decision rule for the identification of profile groups when the cases do not belong to predefined classes. The technique was applied for the identification of low and high proliferation profile groups of diffuse large B-cell lymphomas according to the immunohistochemical expression levels of proliferation proteins. In a forward growing classification tree method, the size of the tree is controlled by the improvement (threshold value) in the apparent misclassification rate after each split. The classes used in the tree were defined using k-means clustering. The decision rule consisted of the splitting points of the split variables used. The methodology was applied to the histology data from 79 cases of diffuse large B-cell lymphomas. Ten classes of individual cases were derived from k-means clustering. Then, a classification tree with a threshold of 2% was used to derive the decision rule. Branches at the left side of the tree consisted of individuals with a low proliferation profile and branches at the right side of the tree consisted of cases with a high proliferation profile. The classification tree, as a supplement method, not only identified but also provided decision rules for identifying profile groups. Finally, it also allowed for exploration of the data structure.
Collapse
Affiliation(s)
- Elias Zintzaras
- Department of Biomathematics, University of Thessaly School of Medicine, Larissa, Greece.
| | | | | | | | | |
Collapse
|
22
|
Sidhu A, Yang ZR. Prediction of signal peptides using bio-basis function neural networks and decision trees. ACTA ACUST UNITED AC 2006; 5:13-9. [PMID: 16539533 DOI: 10.2165/00822942-200605010-00002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Signal peptide identification is of immense importance in drug design. Accurate identification of signal peptides is the first critical step to be able to change the direction of the targeting proteins and use the designed drug to target a specific organelle to correct a defect. Because experimental identification is the most accurate method, but is expensive and time-consuming, an efficient and affordable automated system is of great interest. In this article, we propose using an adapted neural network, called a bio-basis function neural network, and decision trees for predicting signal peptides. The bio-basis function neural network model and decision trees achieved 97.16% and 97.63% accuracy respectively, demonstrating that the methods work well for the prediction of signal peptides. Moreover, decision trees revealed that position P(1'), which is important in forming signal peptides, most commonly comprises either leucine or alanine. This concurs with the (P(3)-P(1)-P(1')) coupling model.
Collapse
Affiliation(s)
- Ateesh Sidhu
- Biological Science, University of Warwick, Coventry, UK.
| | | |
Collapse
|
23
|
Boulesteix AL, Tutz G. Identification of interaction patterns and classification with applications to microarray data. Comput Stat Data Anal 2006. [DOI: 10.1016/j.csda.2004.10.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
24
|
Alexe G, Alexe S, Axelrod DE, Bonates TO, Lozina II, Reiss M, Hammer PL. Breast cancer prognosis by combinatorial analysis of gene expression data. Breast Cancer Res 2006; 8:R41. [PMID: 16859500 PMCID: PMC1779471 DOI: 10.1186/bcr1512] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2005] [Revised: 06/15/2006] [Accepted: 06/15/2006] [Indexed: 01/25/2023] Open
Abstract
INTRODUCTION The potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors. METHOD Data were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines. RESULTS LAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics. CONCLUSION The study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses.
Collapse
Affiliation(s)
- Gabriela Alexe
- RUTCOR (Rutgers University Center for Operations Research), Piscataway, New Jersey, USA
- Computational Biology Center, TJ Watson IBM Research, Yorktown Heights, New York, USA
- The Simons Center for Systems Biology, Institute for Advanced Study, Princeton, New Jersey, USA
| | - Sorin Alexe
- RUTCOR (Rutgers University Center for Operations Research), Piscataway, New Jersey, USA
| | - David E Axelrod
- Department of Genetics, Rutgers University, Piscataway, New Jersey, USA
- The Cancer Institute of New Jersey, New Brunswick, New Jersey, USA
| | - Tibérius O Bonates
- RUTCOR (Rutgers University Center for Operations Research), Piscataway, New Jersey, USA
| | - Irina I Lozina
- RUTCOR (Rutgers University Center for Operations Research), Piscataway, New Jersey, USA
| | - Michael Reiss
- The Cancer Institute of New Jersey, New Brunswick, New Jersey, USA
- Division of Medical Oncology, UMDNJ-Robert Wood Johnson Medical School, New Brunswick, New Jersey, USA
| | - Peter L Hammer
- RUTCOR (Rutgers University Center for Operations Research), Piscataway, New Jersey, USA
| |
Collapse
|
25
|
Berchuck A, Iversen ES, Lancaster JM, Pittman J, Luo J, Lee P, Murphy S, Dressman HK, Febbo PG, West M, Nevins JR, Marks JR. Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers. Clin Cancer Res 2005; 11:3686-96. [PMID: 15897565 DOI: 10.1158/1078-0432.ccr-04-2398] [Citation(s) in RCA: 212] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE A better understanding of the underlying biology of invasive serous ovarian cancer is critical for the development of early detection strategies and new therapeutics. The objective of this study was to define gene expression patterns associated with favorable survival. EXPERIMENTAL DESIGN RNA from 65 serous ovarian cancers was analyzed using Affymetrix U133A microarrays. This included 54 stage III/IV cases (30 short-term survivors who lived <3 years and 24 long-term survivors who lived >7 years) and 11 stage I/II cases. Genes were screened on the basis of their level of and variability in expression, leaving 7,821 for use in developing a predictive model for survival. A composite predictive model was developed that combines Bayesian classification tree and multivariate discriminant models. Leave-one-out cross-validation was used to select and evaluate models. RESULTS Patterns of genes were identified that distinguish short-term and long-term ovarian cancer survivors. The expression model developed for advanced stage disease classified all 11 early-stage ovarian cancers as long-term survivors. The MAL gene, which has been shown to confer resistance to cancer therapy, was most highly overexpressed in short-term survivors (3-fold compared with long-term survivors, and 29-fold compared with early-stage cases). These results suggest that gene expression patterns underlie differences in outcome, and an examination of the genes that provide this discrimination reveals that many are implicated in processes that define the malignant phenotype. CONCLUSIONS Differences in survival of advanced ovarian cancers are reflected by distinct patterns of gene expression. This biological distinction is further emphasized by the finding that early-stage cancers share expression patterns with the advanced stage long-term survivors, suggesting a shared favorable biology.
Collapse
Affiliation(s)
- Andrew Berchuck
- Department of Obstetrics and Gynecology/Division of Gynecologic Oncology, Institute of Statistics and Decision Sciences, Center for Applied Genomics and Technology, Duke University Medical Center, Durham, North Carolina, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Yang ZR. Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection. Bioinformatics 2005; 21:2644-50. [PMID: 15797903 PMCID: PMC7197706 DOI: 10.1093/bioinformatics/bti404] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2004] [Revised: 02/07/2005] [Accepted: 03/22/2005] [Indexed: 12/02/2022] Open
Abstract
MOTIVATION Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists from various disciplines worldwide is to study the specificity of cleavage activity of SARS-related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as P2-P1-P1'-P2', the conventional inductive programming method may result in a rule like 'if P1 = Q, then the sub-sequence is cleaved, otherwise non-cleaved'. If the site P1 is not orthogonal to the others (for instance, P2, P1' and P2'), the prediction power of these kind of rules may be limited. Therefore this study is aimed at developing a novel method for constructing non-orthogonal decision trees for mining protease data. RESULT Eighteen sequences of coronavirus polyprotein were downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites were experimentally determined. These sequences were scanned using a sliding window with size k to generate about 50,000 k-mer sub-sequences (for short, k-mers). The value of k varies from 4 to 12 with a gap of two. The bio-basis function proposed by Thomson et al. is used to transform the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about 10 out of 50,000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved.
Collapse
Affiliation(s)
- Zheng Rong Yang
- Department of Computer Science, Exeter University, United Kingdom.
| |
Collapse
|
27
|
Geman D, d'Avignon C, Naiman DQ, Winslow RL. Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 2004; 3:Article19. [PMID: 16646797 PMCID: PMC1989150 DOI: 10.2202/1544-6115.1071] [Citation(s) in RCA: 226] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
We present a new approach to molecular classification based on mRNA comparisons. Our method, referred to as the top-scoring pair(s) (TSP) classifier, is motivated by current technical and practical limitations in using gene expression microarray data for class prediction, for example to detect disease, identify tumors or predict treatment response. Accurate statistical inference from such data is difficult due to the small number of observations, typically tens, relative to the large number of genes, typically thousands. Moreover, conventional methods from machine learning lead to decisions which are usually very difficult to interpret in simple or biologically meaningful terms. In contrast, the TSP classifier provides decision rules which i) involve very few genes and only relative expression values (e.g., comparing the mRNA counts within a single pair of genes); ii) are both accurate and transparent; and iii) provide specific hypotheses for follow-up studies. In particular, the TSP classifier achieves prediction rates with standard cancer data that are as high as those of previous studies which use considerably more genes and complex procedures. Finally, the TSP classifier is parameter-free, thus avoiding the type of over-fitting and inflated estimates of performance that result when all aspects of learning a predictor are not properly cross-validated.
Collapse
Affiliation(s)
- Donald Geman
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University,
| | - Christian d'Avignon
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Biomedical Engineering, Johns Hopkins University,
| | - Daniel Q. Naiman
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute and Department of Applied Mathematics and Statistics, Johns Hopkins University,
| | - Raimond L. Winslow
- Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, and Department of Biomedical Engineering, Johns Hopkins University,
| |
Collapse
|