1
|
Madill-Thomsen K, Halloran P. Precision diagnostics in transplanted organs using microarray-assessed gene expression: concepts and technical methods of the Molecular Microscope® Diagnostic System (MMDx). Clin Sci (Lond) 2024; 138:663-685. [PMID: 38819301 PMCID: PMC11147747 DOI: 10.1042/cs20220530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 04/26/2024] [Accepted: 05/02/2024] [Indexed: 06/01/2024]
Abstract
There is a major unmet need for improved accuracy and precision in the assessment of transplant rejection and tissue injury. Diagnoses relying on histologic and visual assessments demonstrate significant variation between expert observers (as represented by low kappa values) and have limited ability to assess many biological processes that produce little histologic changes, for example, acute injury. Consensus rules and guidelines for histologic diagnosis are useful but may have errors. Risks of over- or under-treatment can be serious: many therapies for transplant rejection or primary diseases are expensive and carry risk for significant adverse effects. Improved diagnostic methods could alleviate healthcare costs by reducing treatment errors, increase treatment efficacy, and serve as useful endpoints for clinical trials of new agents that can improve outcomes. Molecular diagnostic assessments using microarrays combined with machine learning algorithms for interpretation have shown promise for increasing diagnostic precision via probabilistic assessments, recalibrating standard of care diagnostic methods, clarifying ambiguous cases, and identifying potentially missed cases of rejection. This review describes the development and application of the Molecular Microscope® Diagnostic System (MMDx), and discusses the history and reasoning behind many common methods, statistical practices, and computational decisions employed to ensure that MMDx scores are as accurate and precise as possible. MMDx provides insights on disease processes and highly reproducible results from a comparatively small amount of tissue and constitutes a general approach that is useful in many areas of medicine, including kidney, heart, lung, and liver transplants, with the possibility of extrapolating lessons for understanding native organ disease states.
Collapse
Affiliation(s)
- Katelynn S. Madill-Thomsen
- Department of Medicine, University of Alberta, Edmonton, AB, Canada
- Alberta Transplant Applied Genomics Center, University of Alberta, Edmonton, AB, Canada
| | - Philip F. Halloran
- Department of Medicine, University of Alberta, Edmonton, AB, Canada
- Alberta Transplant Applied Genomics Center, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
2
|
Gene Expression Signature Associated with Clinical Outcome in ALK-Positive Anaplastic Large Cell Lymphoma. Cancers (Basel) 2021; 13:cancers13215523. [PMID: 34771686 PMCID: PMC8582782 DOI: 10.3390/cancers13215523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 10/15/2021] [Accepted: 10/28/2021] [Indexed: 11/17/2022] Open
Abstract
Simple Summary Anaplastic large cell lymphomas associated with ALK translocation have a good outcome after CHOP treatment; however, the 2-year relapse rate remains at 30%. Microarray gene-expression profiling, high throughput RT-qPCR, and RNA sequencing of 48 ALK-positive anaplastic large cell lymphoma (ALK+ ALCL) samples obtained at diagnosis enable the identification of genes associated with clinical outcome. More particularly, our molecular signatures indicate that the FN1 gene, a matrix key regulator, might also be involved in the prognosis and the therapeutic response in anaplastic lymphomas. Abstract Anaplastic large cell lymphomas associated with ALK translocation have a good outcome after CHOP treatment; however, the 2-year relapse rate remains at 30%. Microarray gene-expression profiling of 48 samples obtained at diagnosis was used to identify 47 genes that were differentially expressed between patients with early relapse/progression and no relapse. In the relapsing group, the most significant overrepresented genes were related to the regulation of the immune response and T-cell activation while those in the non-relapsing group were involved in the extracellular matrix. Fluidigm technology gave concordant results for 29 genes, of which FN1, FAM179A, and SLC40A1 had the strongest predictive power after logistic regression and two classification algorithms. In parallel with 39 samples, we used a Kallisto/Sleuth pipeline to analyze RNA sequencing data and identified 20 genes common to the 28 genes validated by Fluidigm technology—notably, the FAM179A and FN1 genes. Interestingly, FN1 also belongs to the gene signature predicting longer survival in diffuse large B-cell lymphomas treated with CHOP. Thus, our molecular signatures indicate that the FN1 gene, a matrix key regulator, might also be involved in the prognosis and the therapeutic response in anaplastic lymphomas.
Collapse
|
3
|
Lu M. An embedded method for gene identification problems involving unwanted data heterogeneity. Hum Genomics 2019; 13:45. [PMID: 31639059 PMCID: PMC6805328 DOI: 10.1186/s40246-019-0228-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern applications such as bioinformatics collecting data in various ways can easily result in heterogeneous data. Traditional variable selection methods assume samples are independent and identically distributed, which however is not suitable for these applications. Some existing statistical models capable of taking care of unwanted variation were developed for gene identification involving heterogeneous data, but they lack model predictability and suffer from variable redundancy. RESULTS By accounting for the unwanted heterogeneity effectively, our method have shown its superiority over several state-of-the art methods, which is validated by the experimental results in both unsupervised and supervised gene identification problems. Moreover, we also applied our method to a pan-cancer study where our method can identify the most discriminative genes best distinguishing different cancer types. CONCLUSIONS This article provides an alternative gene identification method that can accounting for unwanted data heterogeneity. It is a promising method to provide new insights into the complex cancer biology and clues for understanding tumorigenesis and tumor progression.
Collapse
Affiliation(s)
- Meng Lu
- Department of Information Management,Tianjin University, Tianjin, China.
| |
Collapse
|
4
|
Weber LM, Saelens W, Cannoodt R, Soneson C, Hapfelmeier A, Gardner PP, Boulesteix AL, Saeys Y, Robinson MD. Essential guidelines for computational method benchmarking. Genome Biol 2019; 20:125. [PMID: 31221194 PMCID: PMC6584985 DOI: 10.1186/s13059-019-1738-8] [Citation(s) in RCA: 77] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.
Collapse
Affiliation(s)
- Lukas M Weber
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
| | - Wouter Saelens
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Robrecht Cannoodt
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland
- Present address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, 4058, Basel, Switzerland
| | - Alexander Hapfelmeier
- Institute of Medical Informatics, Statistics and Epidemiology, Technical University of Munich, 81675, Munich, Germany
| | - Paul P Gardner
- Department of Biochemistry, University of Otago, Dunedin, 9016, New Zealand
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University, 81377, Munich, Germany
| | - Yvan Saeys
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium.
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, 8057, Zurich, Switzerland.
| |
Collapse
|
5
|
Tian L, Dong X, Freytag S, Lê Cao KA, Su S, JalalAbadi A, Amann-Zalcenstein D, Weber TS, Seidi A, Jabbari JS, Naik SH, Ritchie ME. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019; 16:479-487. [DOI: 10.1038/s41592-019-0425-8] [Citation(s) in RCA: 183] [Impact Index Per Article: 36.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 04/18/2019] [Indexed: 11/09/2022]
|
6
|
Ahmed W, Malik MFA, Saeed M, Haq F. Copy number profiling of Oncotype DX genes reveals association with survival of breast cancer patients. Mol Biol Rep 2018; 45:2185-2192. [PMID: 30225582 DOI: 10.1007/s11033-018-4379-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 09/10/2018] [Indexed: 12/17/2022]
Abstract
Copy number variations (CNVs) are key contributors in breast cancer initiation and progression. However, to date, no CNV-based gene signature is developed for breast cancer. 21-gene Oncotype DX, a clinically validated signature, was identified using only RNA expression data in breast cancer patients. In this study, we evaluated whether CNVs of Oncotype DX genes can be used to predict the prognosis of breast cancer patients. Transcriptomic data of 547 and genomic data of 816 of breast cancer patients were downloaded from The Cancer Genome Atlas database. To establish the prognostic relevance between the CNVs of Oncotype DX genes and clinicopathological features, statistical analysis including Pearson Correlation, Fisher-exact, Chi square, Kaplan-Meier survival and Cox regression analyses were performed. 86% genes showed positive CNV-expression correlation. CNVs in 52% and 47.6% genes showed association with ER+ and PR+ status, respectively. 71% of the genes (including ERBB2, CTSV, CD68, GRB7, MKI67, MMP1, PGR, RPLP0, TFRC, BAG1, BCL2, BIRC5, FLNB, GSTM1 and SCUBE2) showed association with poor overall survival. 14% of the genes (including CTSV, RPLP0 and BIRC5) genes showed association with disease free survival. Cox regression analysis revealed ESR1, metastasis and node stage as independent prognostic factors for overall survival of breast cancer patients. The results suggested that CNV-based assay of Oncotype DX genes can be used to predict the survival of breast cancer patients. In future, identifying new gene signatures for better breast cancer prognosis using CNV level information will be worth investigating.
Collapse
Affiliation(s)
- Washaakh Ahmed
- National Center for Bioinformatics, Quaid-i-Azam University, Islamabad, Pakistan.,Department of Biosciences, COMSATS University, Islamabad, Pakistan
| | | | - Muhammad Saeed
- Department of Biosciences, COMSATS University, Islamabad, Pakistan
| | - Farhan Haq
- Department of Biosciences, COMSATS University, Islamabad, Pakistan.
| |
Collapse
|
7
|
Transcriptomic analysis of the heat stress response for a commercial baker's yeast Saccharomyces cerevisiae. Genes Genomics 2018; 40:137-150. [PMID: 29892925 DOI: 10.1007/s13258-017-0616-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 10/01/2017] [Indexed: 10/18/2022]
Abstract
The aim of this study is to explore the effects of heat stresses on global gene expression profiles and to identify the candidate genes for the heat stress response in commercial baker's yeast (Saccharomyces cerevisiae) by using microarray technology and comparative statistical data analyses. The data from all hybridizations and array normalization were analyzed using the GeneSpringGX 12.1 (Agilent) and the R 2.15.2 program language. In the analysis, all required statistical methods were performed comparatively. For the normalization step, among alternatives, the RMA (Robust Microarray Analysis) results were used. To determine differentially expressed genes under heat stress treatments, the fold-change and the hypothesis testing approaches were executed under various cut-off values via different multiple testing procedures then the up/down regulated probes were functionally categorized via the PAMSAM clustering. The results of the analysis concluded that the transcriptome changes under the heat shock. Moreover, the temperature-shift stress treatments show that the number of differentially up-regulated genes among the heat shock proteins and transcription factors changed significantly. Finally, the change in temperature is one of the important environmental conditions affecting propagation and industrial application of baker's yeast. This study statistically analyzes this affect via one-channel microarray data.
Collapse
|
8
|
Holik AZ, Law CW, Liu R, Wang Z, Wang W, Ahn J, Asselin-Labat ML, Smyth GK, Ritchie ME. RNA-seq mixology: designing realistic control experiments to compare protocols and analysis methods. Nucleic Acids Res 2017; 45:e30. [PMID: 27899618 PMCID: PMC5389713 DOI: 10.1093/nar/gkw1063] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 10/24/2016] [Indexed: 11/25/2022] Open
Abstract
Carefully designed control experiments provide a gold standard for benchmarking different genomics research tools. A shortcoming of many gene expression control studies is that replication involves profiling the same reference RNA sample multiple times. This leads to low, pure technical noise that is atypical of regular studies. To achieve a more realistic noise structure, we generated a RNA-sequencing mixture experiment using two cell lines of the same cancer type. Variability was added by extracting RNA from independent cell cultures and degrading particular samples. The systematic gene expression changes induced by this design allowed benchmarking of different library preparation kits (standard poly-A versus total RNA with Ribozero depletion) and analysis pipelines. Data generated using the total RNA kit had more signal for introns and various RNA classes (ncRNA, snRNA, snoRNA) and less variability after degradation. For differential expression analysis, voom with quality weights marginally outperformed other popular methods, while for differential splicing, DEXSeq was simultaneously the most sensitive and the most inconsistent method. For sample deconvolution analysis, DeMix outperformed IsoPure convincingly. Our RNA-sequencing data set provides a valuable resource for benchmarking different protocols and data pre-processing workflows. The extra noise mimics routine lab experiments more closely, ensuring any conclusions are widely applicable.
Collapse
Affiliation(s)
- Aliaksei Z Holik
- ACRF Stem Cells and Cancer Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Charity W Law
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia.,Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia
| | - Ruijie Liu
- Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia
| | - Zeya Wang
- Statistics Department, George R. Brown School of Engineering, Rice University, 6100 Main Street, Duncan Hall 2124, Houston, TX 77005, USA.,Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA
| | - Wenyi Wang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA
| | - Jaeil Ahn
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University School of Medicine, 4000 Reservoir Road NW, Washington, DC 20057, USA
| | - Marie-Liesse Asselin-Labat
- ACRF Stem Cells and Cancer Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Gordon K Smyth
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Matthew E Ritchie
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia.,Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.,School of Mathematics and Statistics, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|
9
|
Strbenac D, Zhong L, Raftery MJ, Wang P, Wilson SR, Armstrong NJ, Yang JYH. Quantitative Performance Evaluator for Proteomics (QPEP): Web-based Application for Reproducible Evaluation of Proteomics Preprocessing Methods. J Proteome Res 2017; 16:2359-2369. [DOI: 10.1021/acs.jproteome.6b00882] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Affiliation(s)
- Dario Strbenac
- School
of Mathematics and Statistics, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Ling Zhong
- Bioanalytical
Mass Spectrometry Facility, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Mark J. Raftery
- Bioanalytical
Mass Spectrometry Facility, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Penghao Wang
- School
of Mathematics and Statistics, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Susan R. Wilson
- School of Mathematics & Statistics, University of New South Wales, Sydney, New South Wales 2052, Australia
- Centre
for Mathematics and its Applications, Mathematical Sciences Institute, Australian National University, Canberra, Australian Capital Territory 0200, Australia
| | - Nicola J. Armstrong
- School
of Mathematics and Statistics, University of Sydney, Sydney, New South Wales 2006, Australia
| | - Jean Y. H. Yang
- School
of Mathematics and Statistics, University of Sydney, Sydney, New South Wales 2006, Australia
| |
Collapse
|
10
|
Pan JC, Huang Y, Hwang JG. Estimation of selected parameters. Comput Stat Data Anal 2017. [DOI: 10.1016/j.csda.2016.11.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
11
|
Teng M, Love MI, Davis CA, Djebali S, Dobin A, Graveley BR, Li S, Mason CE, Olson S, Pervouchine D, Sloan CA, Wei X, Zhan L, Irizarry RA. A benchmark for RNA-seq quantification pipelines. Genome Biol 2016; 17:74. [PMID: 27107712 PMCID: PMC4842274 DOI: 10.1186/s13059-016-0940-1] [Citation(s) in RCA: 119] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 04/08/2016] [Indexed: 02/07/2023] Open
Abstract
Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package (http://bioconductor.org/packages/rnaseqcomp). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.,School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA
| | - Carrie A Davis
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Sarah Djebali
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Alexander Dobin
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Brenton R Graveley
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Sheng Li
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Christopher E Mason
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Sara Olson
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Dmitri Pervouchine
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Cricket A Sloan
- Department of Genetics, Stanford University, 300 Pasteur Drive, MC-5477, Stanford, CA, 94305, USA
| | - Xintao Wei
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Lijun Zhan
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA. .,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.
| |
Collapse
|
12
|
Depiereux S, De Meulder B, Bareke E, Berger F, Le Gac F, Depiereux E, Kestemont P. Adaptation of a Bioinformatics Microarray Analysis Workflow for a Toxicogenomic Study in Rainbow Trout. PLoS One 2015; 10:e0128598. [PMID: 26186543 PMCID: PMC4506078 DOI: 10.1371/journal.pone.0128598] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 04/28/2015] [Indexed: 12/26/2022] Open
Abstract
Sex steroids play a key role in triggering sex differentiation in fish, the use of exogenous hormone treatment leading to partial or complete sex reversal. This phenomenon has attracted attention since the discovery that even low environmental doses of exogenous steroids can adversely affect gonad morphology (ovotestis development) and induce reproductive failure. Modern genomic-based technologies have enhanced opportunities to find out mechanisms of actions (MOA) and identify biomarkers related to the toxic action of a compound. However, high throughput data interpretation relies on statistical analysis, species genomic resources, and bioinformatics tools. The goals of this study are to improve the knowledge of feminisation in fish, by the analysis of molecular responses in the gonads of rainbow trout fry after chronic exposure to several doses (0.01, 0.1, 1 and 10 μg/L) of ethynylestradiol (EE2) and to offer target genes as potential biomarkers of ovotestis development. We successfully adapted a bioinformatics microarray analysis workflow elaborated on human data to a toxicogenomic study using rainbow trout, a fish species lacking accurate functional annotation and genomic resources. The workflow allowed to obtain lists of genes supposed to be enriched in true positive differentially expressed genes (DEGs), which were subjected to over-representation analysis methods (ORA). Several pathways and ontologies, mostly related to cell division and metabolism, sexual reproduction and steroid production, were found significantly enriched in our analyses. Moreover, two sets of potential ovotestis biomarkers were selected using several criteria. The first group displayed specific potential biomarkers belonging to pathways/ontologies highlighted in the experiment. Among them, the early ovarian differentiation gene foxl2a was overexpressed. The second group, which was highly sensitive but not specific, included the DEGs presenting the highest fold change and lowest p-value of the statistical workflow output. The methodology can be generalized to other (non-model) species and various types of microarray platforms.
Collapse
Affiliation(s)
- Sophie Depiereux
- Unit of research in Environmental and Evolutionary Biology (URBE-NARILIS), Laboratory of Ecophysiology and Ecotoxicology, University of Namur, Namur, Belgium
| | - Bertrand De Meulder
- Unit of Research in Molecular Biology (URBM-NARILIS), University of Namur, Namur, Belgium
| | - Eric Bareke
- Unit of Research in Molecular Biology (URBM-NARILIS), University of Namur, Namur, Belgium
- Sainte-Justine UHC Research Centre, University of Montreal, Montréal (Québec), H3T 1C5, Canada
| | - Fabrice Berger
- Unit of Research in Molecular Biology (URBM-NARILIS), University of Namur, Namur, Belgium
| | - Florence Le Gac
- Institut National de la Recherche Agronomique, INRA-LPGP, UPR1037, Campus de Beaulieu, 35042, Rennes, France
| | - Eric Depiereux
- Unit of Research in Molecular Biology (URBM-NARILIS), University of Namur, Namur, Belgium
| | - Patrick Kestemont
- Unit of research in Environmental and Evolutionary Biology (URBE-NARILIS), Laboratory of Ecophysiology and Ecotoxicology, University of Namur, Namur, Belgium
| |
Collapse
|
13
|
Boareto M, Caticha N. t-Test at the Probe Level: An Alternative Method to Identify Statistically Significant Genes for Microarray Data. MICROARRAYS 2014; 3:340-51. [PMID: 27600352 PMCID: PMC4979051 DOI: 10.3390/microarrays3040340] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Revised: 11/21/2014] [Accepted: 12/09/2014] [Indexed: 11/16/2022]
Abstract
Microarray data analysis typically consists in identifying a list of differentially expressed genes (DEG), i.e., the genes that are differentially expressed between two experimental conditions. Variance shrinkage methods have been considered a better choice than the standard t-test for selecting the DEG because they correct the dependence of the error with the expression level. This dependence is mainly caused by errors in background correction, which more severely affects genes with low expression values. Here, we propose a new method for identifying the DEG that overcomes this issue and does not require background correction or variance shrinkage. Unlike current methods, our methodology is easy to understand and implement. It consists of applying the standard t-test directly on the normalized intensity data, which is possible because the probe intensity is proportional to the gene expression level and because the t-test is scale- and location-invariant. This methodology considerably improves the sensitivity and robustness of the list of DEG when compared with the t-test applied to preprocessed data and to the most widely used shrinkage methods, Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA). Our approach is useful especially when the genes of interest have small differences in expression and therefore get ignored by standard variance shrinkage methods.
Collapse
Affiliation(s)
- Marcelo Boareto
- Institute of Physics, University of São Paulo, São Paulo, SP 05508-900, Brazil.
| | - Nestor Caticha
- Institute of Physics, University of São Paulo, São Paulo, SP 05508-900, Brazil.
| |
Collapse
|
14
|
Richard AC, Lyons PA, Peters JE, Biasci D, Flint SM, Lee JC, McKinney EF, Siegel RM, Smith KGC. Comparison of gene expression microarray data with count-based RNA measurements informs microarray interpretation. BMC Genomics 2014; 15:649. [PMID: 25091430 PMCID: PMC4143561 DOI: 10.1186/1471-2164-15-649] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2014] [Accepted: 07/17/2014] [Indexed: 01/02/2023] Open
Abstract
Background Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study. Results Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this “gold-standard” comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues. Conclusions Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-649) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Kenneth G C Smith
- Cambridge Institute for Medical Research and Department of Medicine, University of Cambridge, Cambridge, UK.
| |
Collapse
|
15
|
De Meulder B, Berger F, Bareke E, Depiereux S, Michiels C, Depiereux E. Meta-analysis and gene set analysis of archived microarrays suggest implication of the spliceosome in metastatic and hypoxic phenotypes. PLoS One 2014; 9:e86699. [PMID: 24497970 PMCID: PMC3908947 DOI: 10.1371/journal.pone.0086699] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2013] [Accepted: 12/10/2013] [Indexed: 12/17/2022] Open
Abstract
We propose to make use of the wealth of underused DNA chip data available in public repositories to study the molecular mechanisms behind the adaptation of cancer cells to hypoxic conditions leading to the metastatic phenotype. We have developed new bioinformatics tools and adapted others to identify with maximum sensitivity those genes which are expressed differentially across several experiments. The comparison of two analytical approaches, based on either Over Representation Analysis or Functional Class Scoring, by a meta-analysis-based approach, led to the retrieval of known information about the biological situation - thus validating the model - but also more importantly to the discovery of the previously unknown implication of the spliceosome, the cellular machinery responsible for mRNA splicing, in the development of metastasis.
Collapse
Affiliation(s)
- Bertrand De Meulder
- Microorganism Biology Research Unit -NARILIS, University of Namur, Namur, Belgium
| | - Fabrice Berger
- Microorganism Biology Research Unit -NARILIS, University of Namur, Namur, Belgium
| | - Eric Bareke
- Sainte Justine University Hospital Center Research Center, University of Montreal, Montreal, Canada
| | - Sophie Depiereux
- Environmental and Evolutional Research Unit, University of Namur, Namur, Belgium
| | - Carine Michiels
- Cellular Biology Research Unit - NARILIS, University of Namur, Namur, Belgium
| | - Eric Depiereux
- Microorganism Biology Research Unit -NARILIS, University of Namur, Namur, Belgium
- * E-mail:
| |
Collapse
|
16
|
Khamiakova T, Shkedy Z, Amaratunga D, Talloen W, Göhlmann H, Bijnens L, Kasim A. Quality control of Platinum Spike dataset by probe-level mixed models. Math Biosci 2014; 248:1-10. [DOI: 10.1016/j.mbs.2013.11.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2013] [Revised: 11/20/2013] [Accepted: 11/21/2013] [Indexed: 10/25/2022]
|
17
|
Hossain A, Willan AR, Beyene J. An Improved Method on Wilcoxon Rank Sum Test for Gene Selection from Microarray Experiments. COMMUN STAT-SIMUL C 2013. [DOI: 10.1080/03610918.2012.667479] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
18
|
Welsh EA, Eschrich SA, Berglund AE, Fenstermacher DA. Iterative rank-order normalization of gene expression microarray data. BMC Bioinformatics 2013; 14:153. [PMID: 23647742 PMCID: PMC3651355 DOI: 10.1186/1471-2105-14-153] [Citation(s) in RCA: 96] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2012] [Accepted: 04/29/2013] [Indexed: 11/25/2022] Open
Abstract
Background Many gene expression normalization algorithms exist for Affymetrix GeneChip microarrays. The most popular of these is RMA, primarily due to the precision and low noise produced during the process. A significant strength of this and similar approaches is the use of the entire set of arrays during both normalization and model-based estimation of signal. However, this leads to differing estimates of expression based on the starting set of arrays, and estimates can change when a single, additional chip is added to the set. Additionally, outlier chips can impact the signals of other arrays, and can themselves be skewed by the majority of the population. Results We developed an approach, termed IRON, which uses the best-performing techniques from each of several popular processing methods while retaining the ability to incrementally renormalize data without altering previously normalized expression. This combination of approaches results in a method that performs comparably to existing approaches on artificial benchmark datasets (i.e. spike-in) and demonstrates promising improvements in segregating true signals within biologically complex experiments. Conclusions By combining approaches from existing normalization techniques, the IRON method offers several advantages. First, IRON normalization occurs pair-wise, thereby avoiding the need for all chips to be normalized together, which can be important for large data analyses. Secondly, the technique does not require similarity in signal distribution across chips for normalization, which can be important for maintaining biologically relevant differences in a heterogeneous background. Lastly, IRON introduces fewer post-processing artifacts, particularly in data whose behavior violates common assumptions. Thus, the IRON method provides a practical solution to common needs of expression analysis. A software implementation of IRON is available at [http://gene.moffitt.org/libaffy/].
Collapse
Affiliation(s)
- Eric A Welsh
- H Lee Moffitt Cancer Center and Research Institute, University of South Florida, Tampa, FL 33612, USA.
| | | | | | | |
Collapse
|
19
|
Hossain A, Willan AR, Beyene J. A flexible nonparametric approach to find candidate genes associated with disease in microarray experiments. J Bioinform Comput Biol 2013; 11:1250021. [PMID: 23600812 DOI: 10.1142/s0219720012500217] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Very often biologists are interested to know the biological function of a particular gene. Its true biological function may depend on other genes. Finding other genes in the same biological pathway of that gene may enhance further understanding of its biological function. Therefore, we are interested in finding other candidate genes whose expression values are highly correlated with that of a "seed" gene. The "seed" gene, which is known and associated with a disease, is used as a reference to extract candidate genes from microarray experiments and enriched pathways. We propose a nonparametric procedure for selecting the candidate genes. The expression levels for these candidate genes are correlated with that of a "seed" gene in microarray experiments. The proposed test statistic compares two Area Under Receiver Operating Characteristic Curves (AUC) for gene pairs, taking implicit correlation between two AUCs into account. The performance of our method is compared to the other well-known methods through the use of simulation and real data analysis.
Collapse
Affiliation(s)
- Ahmed Hossain
- Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, ON M5T 3M7, Canada.
| | | | | |
Collapse
|
20
|
Lahti L, Torrente A, Elo LL, Brazma A, Rung J. A fully scalable online pre-processing algorithm for short oligonucleotide microarray atlases. Nucleic Acids Res 2013; 41:e110. [PMID: 23563154 PMCID: PMC3664815 DOI: 10.1093/nar/gkt229] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Rapid accumulation of large and standardized microarray data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of these data resources. Although short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level techniques have been available only for few platforms based on pre-calculated probe effects from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm for probe-level analysis and pre-processing of large microarray atlases involving tens of thousands of arrays. In contrast to the alternatives, our algorithm scales up linearly with respect to sample size and is applicable to all short oligonucleotide platforms. The model can use the most comprehensive data collections available to date to pinpoint individual probes affected by noise and biases, providing tools to guide array design and quality control. This is the only available algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray collections.
Collapse
Affiliation(s)
- Leo Lahti
- Department of Veterinary Bioscience, University of Helsinki, Agnes Sjöbergin katu 2, PO Box 66, FI-00014 University of Helsinki, Finland.
| | | | | | | | | |
Collapse
|
21
|
Correction of spatial bias in oligonucleotide array data. Adv Bioinformatics 2013; 2013:167915. [PMID: 23573083 PMCID: PMC3610395 DOI: 10.1155/2013/167915] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 02/02/2013] [Indexed: 01/17/2023] Open
Abstract
Background. Oligonucleotide microarrays allow for high-throughput gene expression profiling assays. The technology relies on the fundamental assumption that observed hybridization signal intensities (HSIs) for each intended target, on average, correlate with their target's true concentration in the sample. However, systematic, nonbiological variation from several sources undermines this hypothesis. Background hybridization signal has been previously identified as one such important source, one manifestation of which appears in the form of spatial autocorrelation. Results. We propose an algorithm, pyn, for the elimination of spatial autocorrelation in HSIs, exploiting the duality of desirable mutual information shared by probes in a common probe set and undesirable mutual information shared by spatially proximate probes. We show that this correction procedure reduces spatial autocorrelation in HSIs; increases HSI reproducibility across replicate arrays; increases differentially expressed gene detection power; and performs better than previously published methods. Conclusions. The proposed algorithm increases both precision and accuracy, while requiring virtually no changes to users' current analysis pipelines: the correction consists merely of a transformation of raw HSIs (e.g., CEL files for Affymetrix arrays). A free, open-source implementation is provided as an R package, compatible with standard Bioconductor tools. The approach may also be tailored to other platform types and other sources of bias.
Collapse
|
22
|
Jang GW, Lee KT, Park JE, Kim H, Kim TH, Choi BH, Kim MJ, Lim D. Gene Expression Profiling in Hepatic Tissue of two Pig Breeds. JOURNAL OF ANIMAL SCIENCE AND TECHNOLOGY 2012. [DOI: 10.5187/jast.2012.54.6.383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
23
|
Faust D, Vondráček J, Krčmář P, Šmerdová L, Procházková J, Hrubá E, Hulinková P, Kaina B, Dietrich C, Machala M. AhR-mediated changes in global gene expression in rat liver progenitor cells. Arch Toxicol 2012. [DOI: 10.1007/s00204-012-0979-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
24
|
Nair PS, Vihinen M. VariBench: A Benchmark Database for Variations. Hum Mutat 2012; 34:42-9. [DOI: 10.1002/humu.22204] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Accepted: 07/31/2012] [Indexed: 12/21/2022]
|
25
|
Ghorbel MT, Mokhtari A, Sheikh M, Angelini GD, Caputo M. Controlled reoxygenation cardiopulmonary bypass is associated with reduced transcriptomic changes in cyanotic tetralogy of Fallot patients undergoing surgery. Physiol Genomics 2012; 44:1098-106. [PMID: 22991208 DOI: 10.1152/physiolgenomics.00072.2012] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
In cyanotic patients undergoing repair of heart defects, high level of oxygen during cardiopulmonary bypass (CPB) leads to greater susceptibility to myocardial ischemia and reoxygenation injury. This study investigates the effects of controlled reoxygenation CPB on gene expression changes in cyanotic hearts of patients undergoing surgical correction of tetralogy of Fallot (TOF). We randomized 49 cyanotic TOF patients undergoing corrective cardiac surgery to receive either controlled reoxygenation or hyperoxic/standard CPB. Ventricular myocardium biopsies were obtained immediately after starting and before discontinuing CPB. Microarray analyses were performed on samples, and array results validated with real-time PCR. Gene expression profiles before and after hyperoxic/standard CPB revealed 35 differentially expressed genes with three upregulated and 32 downregulated. Upregulated genes included two E3 Ubiquitin ligases. The products of downregulated genes included intracellular signaling kinases, metabolic process proteins, and transport factors. In contrast, gene expression profiles before and after controlled reoxygenation CPB revealed only 11 differentially expressed genes with 10 upregulated including extracellular matrix proteins, transport factors, and one downregulated. The comparison of gene expression following hyperoxic/standard vs. controlled reoxygenation CPB revealed 59 differentially expressed genes, with six upregulated and 53 downregulated. Upregulated genes included PDE1A, MOSC1, and CRIP3. Downregulated genes functionally clustered into four major classes: extracellular matrix/cell adhesion, transcription, transport, and cellular metabolic process. This study provides direct evidence that hyperoxic CPB decreases the adaptation and remodeling capacity in cyanotic patients undergoing TOF repair. This simple CPB strategy of controlled reoxygenation reduced the number of genes whose expression was altered following hyperoxic/standard CPB.
Collapse
Affiliation(s)
- Mohamed T Ghorbel
- Bristol Heart Institute, School of Clinical Science, University of Bristol, Bristol, United Kingdom
| | | | | | | | | |
Collapse
|
26
|
McCall MN, Almudevar A. Affymetrix GeneChip microarray preprocessing for multivariate analyses. Brief Bioinform 2012; 13:536-46. [PMID: 22210854 PMCID: PMC3431718 DOI: 10.1093/bib/bbr072] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Revised: 11/20/2011] [Indexed: 11/15/2022] Open
Abstract
Affymetrix GeneChip microarrays are the most widely used high-throughput technology to measure gene expression, and a wide variety of preprocessing methods have been developed to transform probe intensities reported by a microarray scanner into gene expression estimates. There have been numerous comparisons of these preprocessing methods, focusing on the most common analyses-detection of differential expression and gene or sample clustering. Recently, more complex multivariate analyses, such as gene co-expression, differential co-expression, gene set analysis and network modeling, are becoming more common; however, the same preprocessing methods are typically applied. In this article, we examine the effect of preprocessing methods on some of these multivariate analyses and provide guidance to the user as to which methods are most appropriate.
Collapse
Affiliation(s)
- Matthew N McCall
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, NY, USA.
| | | |
Collapse
|
27
|
Assessing numerical dependence in gene expression summaries with the jackknife expression difference. PLoS One 2012; 7:e39570. [PMID: 22876276 PMCID: PMC3411624 DOI: 10.1371/journal.pone.0039570] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Accepted: 05/27/2012] [Indexed: 11/19/2022] Open
Abstract
Statistical methods to test for differential expression traditionally assume that each gene's expression summaries are independent across arrays. When certain preprocessing methods are used to obtain those summaries, this assumption is not necessarily true. In general, the erroneous assumption of dependence results in a loss of statistical power. We introduce a diagnostic measure of numerical dependence for gene expression summaries from any preprocessing method and discuss the relative performance of several common preprocessing methods with respect to this measure. Some common preprocessing methods introduce non-trivial levels of numerical dependence. The issue of (between-array) dependence has received little if any attention in the literature, and researchers working with gene expression data should not take such properties for granted, or they risk unnecessarily losing statistical power.
Collapse
|
28
|
Zhao Z, Gene Hwang JT. Empirical Bayes false coverage rate controlling confidence intervals. J R Stat Soc Series B Stat Methodol 2012. [DOI: 10.1111/j.1467-9868.2012.01033.x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
29
|
Vihinen M. How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis. BMC Genomics 2012; 13 Suppl 4:S2. [PMID: 22759650 PMCID: PMC3303716 DOI: 10.1186/1471-2164-13-s4-s2] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed. Results Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge. Conclusions The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.
Collapse
Affiliation(s)
- Mauno Vihinen
- Institute of Biomedical Technology, University of Tampere, Finland.
| |
Collapse
|
30
|
An integrated framework to model cellular phenotype as a component of biochemical networks. Adv Bioinformatics 2011; 2011:608295. [PMID: 22190923 PMCID: PMC3235418 DOI: 10.1155/2011/608295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2011] [Accepted: 08/26/2011] [Indexed: 11/25/2022] Open
Abstract
Identification of regulatory molecules in signaling pathways is critical for understanding cellular behavior. Given the complexity of the transcriptional gene network, the relationship between molecular expression and phenotype is difficult to determine using reductionist experimental methods. Computational models provide the means to characterize regulatory mechanisms and predict phenotype in the context of gene networks. Integrating gene expression data with phenotypic data in transcriptional network models enables systematic identification of critical molecules in a biological network. We developed an approach based on fuzzy logic to model cell budding in Saccharomyces cerevisiae using time series expression microarray data of the cell cycle. Cell budding is a phenotype of viable cells undergoing division. Predicted interactions between gene expression and phenotype reflected known biological relationships. Dynamic simulation analysis reproduced the behavior of the yeast cell cycle and accurately identified genes and interactions which are essential for cell viability.
Collapse
|
31
|
Lim D, Lee KT, Park JE, Kim H, Kim TH, Choi BH, Kim MJ, Park HS, Jang GW. Analysis of gene expression profiles from subcutaneous adipose tissue of two pig breeds. Genes Genomics 2011. [DOI: 10.1007/s13258-011-0083-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
32
|
Berger F, Carlon E. From hybridization theory to microarray data analysis: performance evaluation. BMC Bioinformatics 2011; 12:464. [PMID: 22136743 PMCID: PMC3267830 DOI: 10.1186/1471-2105-12-464] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2011] [Accepted: 12/02/2011] [Indexed: 02/05/2023] Open
Abstract
Background Several preprocessing methods are available for the analysis of Affymetrix Genechips arrays. The most popular algorithms analyze the measured fluorescence intensities with statistical methods. Here we focus on a novel algorithm, AffyILM, available from Bioconductor, which relies on inputs from hybridization thermodynamics and uses an extended Langmuir isotherm model to compute transcript concentrations. These concentrations are then employed in the statistical analysis. We compared the performance of AffyILM and other traditional methods both in the old and in the newest generation of GeneChips. Results Tissue mixture and Latin Square datasets (provided by Affymetrix) were used to assess the performances of the differential expression analysis depending on the preprocessing strategy. A correlation analysis conducted on the tissue mixture data reveals that the median-polish algorithm allows to best summarize AffyILM concentrations computed at the probe-level. Those correlation results are equivalent to the best correlations observed using popular preprocessing methods relying on intensity values. The performances of each tested preprocessing algorithm were quantified using the Latin Square HG-U133A dataset, thanks to the comparison of differential analysis results with the list of spiked genes. The figures of merit generated illustrates that the performances associated to AffyILM(medianpolish), inferred from the present statistical analysis, are comparable to the best performing strategies previously reported. Conclusions Converting probe intensities to estimates of target concentrations prior to the statistical analysis, AffyILM(medianpolish) is one of the best performing strategy currently available. Using hybridization theory, probe-level estimates of target concentrations should be identically distributed. In the future, a probe-level multivariate analysis of the concentrations should be compared to the univariate analysis of probe-set summarized expression data.
Collapse
Affiliation(s)
- Fabrice Berger
- Institute for Theoretical Physics, KULeuven, Celestijnenlaan 200D, B-3001 Leuven, Belgium.
| | | |
Collapse
|
33
|
Owzar K, Barry WT, Jung SH. Statistical considerations for analysis of microarray experiments. Clin Transl Sci 2011; 4:466-77. [PMID: 22212230 DOI: 10.1111/j.1752-8062.2011.00309.x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Microarray technologies enable the simultaneous interrogation of expressions from thousands of genes from a biospecimen sample taken from a patient. This large set of expressions generates a genetic profile of the patient that may be used to identify potential prognostic or predictive genes or genetic models for clinical outcomes. The aim of this article is to provide a broad overview of some of the major statistical considerations for the design and analysis of microarrays experiments conducted as correlative science studies to clinical trials. An emphasis will be placed on how the lack of understanding and improper use of statistical concepts and methods will lead to noise discovery and misinterpretation of experimental results.
Collapse
Affiliation(s)
- Kouros Owzar
- Department of Biostatistics and Bioinformatics, Duke University CALGB Statistical Center, Duke University, Durham, North Carolina, USA
| | | | | |
Collapse
|
34
|
Moffitt RA, Yin-Goen Q, Stokes TH, Parry RM, Torrance JH, Phan JH, Young AN, Wang MD. caCORRECT2: Improving the accuracy and reliability of microarray data in the presence of artifacts. BMC Bioinformatics 2011; 12:383. [PMID: 21957981 PMCID: PMC3230913 DOI: 10.1186/1471-2105-12-383] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Accepted: 09/29/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In previous work, we reported the development of caCORRECT, a novel microarray quality control system built to identify and correct spatial artifacts commonly found on Affymetrix arrays. We have made recent improvements to caCORRECT, including the development of a model-based data-replacement strategy and integration with typical microarray workflows via caCORRECT's web portal and caBIG grid services. In this report, we demonstrate that caCORRECT improves the reproducibility and reliability of experimental results across several common Affymetrix microarray platforms. caCORRECT represents an advance over state-of-art quality control methods such as Harshlighting, and acts to improve gene expression calculation techniques such as PLIER, RMA and MAS5.0, because it incorporates spatial information into outlier detection as well as outlier information into probe normalization. The ability of caCORRECT to recover accurate gene expressions from low quality probe intensity data is assessed using a combination of real and synthetic artifacts with PCR follow-up confirmation and the affycomp spike in data. The caCORRECT tool can be accessed at the website: http://cacorrect.bme.gatech.edu. RESULTS We demonstrate that (1) caCORRECT's artifact-aware normalization avoids the undesirable global data warping that happens when any damaged chips are processed without caCORRECT; (2) When used upstream of RMA, PLIER, or MAS5.0, the data imputation of caCORRECT generally improves the accuracy of microarray gene expression in the presence of artifacts more than using Harshlighting or not using any quality control; (3) Biomarkers selected from artifactual microarray data which have undergone the quality control procedures of caCORRECT are more likely to be reliable, as shown by both spike in and PCR validation experiments. Finally, we present a case study of the use of caCORRECT to reliably identify biomarkers for renal cell carcinoma, yielding two diagnostic biomarkers with potential clinical utility, PRKAB1 and NNMT. CONCLUSIONS caCORRECT is shown to improve the accuracy of gene expression, and the reproducibility of experimental results in clinical application. This study suggests that caCORRECT will be useful to clean up possible artifacts in new as well as archived microarray data.
Collapse
Affiliation(s)
- Richard A Moffitt
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, 313 Ferst Drive, Atlanta, GA 30332, USA
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Kadota K, Shimizu K. Evaluating methods for ranking differentially expressed genes applied to microArray quality control data. BMC Bioinformatics 2011; 12:227. [PMID: 21639945 PMCID: PMC3128035 DOI: 10.1186/1471-2105-12-227] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2010] [Accepted: 06/06/2011] [Indexed: 11/12/2022] Open
Abstract
Background Statistical methods for ranking differentially expressed genes (DEGs) from gene expression data should be evaluated with regard to high sensitivity, specificity, and reproducibility. In our previous studies, we evaluated eight gene ranking methods applied to only Affymetrix GeneChip data. A more general evaluation that also includes other microarray platforms, such as the Agilent or Illumina systems, is desirable for determining which methods are suitable for each platform and which method has better inter-platform reproducibility. Results We compared the eight gene ranking methods using the MicroArray Quality Control (MAQC) datasets produced by five manufacturers: Affymetrix, Applied Biosystems, Agilent, GE Healthcare, and Illumina. The area under the curve (AUC) was used as a measure for both sensitivity and specificity. Although the highest AUC values can vary with the definition of "true" DEGs, the best methods were, in most cases, either the weighted average difference (WAD), rank products (RP), or intensity-based moderated t statistic (ibmT). The percentages of overlapping genes (POGs) across different test sites were mainly evaluated as a measure for both intra- and inter-platform reproducibility. The POG values for WAD were the highest overall, irrespective of the choice of microarray platform. The high intra- and inter-platform reproducibility of WAD was also observed at a higher biological function level. Conclusion These results for the five microarray platforms were consistent with our previous ones based on 36 real experimental datasets measured using the Affymetrix platform. Thus, recommendations made using the MAQC benchmark data might be universally applicable.
Collapse
Affiliation(s)
- Koji Kadota
- Agricultural Bioinformatics Research Unit, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Yayoi, Bunkyo-ku, Japan.
| | | |
Collapse
|
36
|
Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A. Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections. PLoS One 2011; 6:e20662. [PMID: 21673802 PMCID: PMC3105991 DOI: 10.1371/journal.pone.0020662] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 05/06/2011] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them. METHODOLOGY AND PRINCIPAL FINDINGS Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures. CONCLUSIONS AND SIGNIFICANCE Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.
Collapse
Affiliation(s)
- Nikita I. Lytkin
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
| | - Lauren McVoy
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
| | - Jörn-Hendrik Weitkamp
- Division of Neonatology, Department of Pediatrics, Vanderbilt University
School of Medicine and Monroe Carell Jr. Children's Hospital at Vanderbilt,
Nashville, Tennessee, United States of America
| | - Constantin F. Aliferis
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee,
United States of America
| | - Alexander Statnikov
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Medicine, New York University School of Medicine, New York,
New York, United States of America
| |
Collapse
|
37
|
Shakya K, Ruskin HJ, Kerr G, Crane M, Becker J. Comparison of microarray preprocessing methods. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2011; 680:139-47. [PMID: 20865495 DOI: 10.1007/978-1-4419-5913-3_16] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/07/2023]
Abstract
Data preprocessing in microarray technology is a crucial initial step before data analysis is performed. Many preprocessing methods have been proposed but none has proved to be ideal to date. Frequently, datasets are limited by laboratory constraints so that the need is for guidelines on quality and robustness, to inform further experimentation while data are yet restricted. In this paper, we compared the performance of four popular methods, namely MAS5, Li & Wong pmonly (LWPM), Li & Wong subtractMM (LWMM), and Robust Multichip Average (RMA). The comparison is based on the analysis carried out on sets of laboratory-generated data from the Bioinformatics Lab, National Institute of Cellular Biotechnology (NICB), Dublin City University, Ireland. These experiments were designed to examine the effect of Bromodeoxyuridine (5-bromo-2-deoxyuridine, BrdU) treatment in deep lamellar keratoplasty (DLKP) cells. The methodology employed is to assess dispersion across the replicates and analyze the false discovery rate. From the dispersion analysis, we found that variability is reduced more effectively by LWPM and RMA methods. From the false positive analysis, and for both parametric and nonparametric approaches, LWMM is found to perform best. Based on a complementary q-value analysis, LWMM approach again is the strongest candidate. The indications are that, while LWMM is marginally less effective than LWPM and RMA in terms of variance reduction, it has considerably improved discrimination overall.
Collapse
Affiliation(s)
- K Shakya
- Dublin City University, Dublin 9, Ireland.
| | | | | | | | | |
Collapse
|
38
|
Taub MA, Corrada Bravo H, Irizarry RA. Overcoming bias and systematic errors in next generation sequencing data. Genome Med 2010; 2:87. [PMID: 21144010 PMCID: PMC3025429 DOI: 10.1186/gm208] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions.
Collapse
Affiliation(s)
- Margaret A Taub
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, E3527, Baltimore, MD 21205, USA
| | - Hector Corrada Bravo
- Department of Computer Science, University of Maryland Institute for Advanced Computer Studies and Center for Bioinformatics and Computational Biology, Biomolecular Sciences Building 296, College Park, MD 20742, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, E3527, Baltimore, MD 21205, USA
| |
Collapse
|
39
|
Dai M, Thompson RC, Maher C, Contreras-Galindo R, Kaplan MH, Markovitz DM, Omenn G, Meng F. NGSQC: cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics 2010; 11 Suppl 4:S7. [PMID: 21143816 PMCID: PMC3005923 DOI: 10.1186/1471-2164-11-s4-s7] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background While the accuracy and precision of deep sequencing data is significantly better than those obtained by the earlier generation of hybridization-based high throughput technologies, the digital nature of deep sequencing output often leads to unwarranted confidence in their reliability. Results The NGSQC (Next Generation Sequencing Quality Control) pipeline provides a set of novel quality control measures for quickly detecting a wide variety of quality issues in deep sequencing data derived from two dimensional surfaces, regardless of the assay technology used. It also enables researchers to determine whether sequencing data related to their most interesting biological discoveries are caused by sequencing quality issues. Conclusions Next generation sequencing platforms have their own share of quality issues and there can be significant lab-to-lab, batch-to-batch and even within chip/slide variations. NGSQC can help to ensure that biological conclusions, in particular those based on relatively rare sequence alterations, are not caused by low quality sequencing.
Collapse
Affiliation(s)
- Manhong Dai
- Department of Psychiatry, University of Michigan, Ann Arbor, MI 48109, USA.
| | | | | | | | | | | | | | | |
Collapse
|
40
|
Kohl M, Deigner HP. Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics 2010; 11:583. [PMID: 21118506 PMCID: PMC3744637 DOI: 10.1186/1471-2105-11-583] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2010] [Accepted: 11/30/2010] [Indexed: 11/17/2022] Open
Abstract
Background The preprocessing of gene expression data obtained from several platforms routinely includes the aggregation of multiple raw signal intensities to one expression value. Examples are the computation of a single expression measure based on the perfect match (PM) and mismatch (MM) probes for the Affymetrix technology, the summarization of bead level values to bead summary values for the Illumina technology or the aggregation of replicated measurements in the case of other technologies including real-time quantitative polymerase chain reaction (RT-qPCR) platforms. The summarization of technical replicates is also performed in other "-omics" disciplines like proteomics or metabolomics. Preprocessing methods like MAS 5.0, Illumina's default summarization method, RMA, or VSN show that the use of robust estimators is widely accepted in gene expression analysis. However, the selection of robust methods seems to be mainly driven by their high breakdown point and not by efficiency. Results We describe how optimally robust radius-minimax (rmx) estimators, i.e. estimators that minimize an asymptotic maximum risk on shrinking neighborhoods about an ideal model, can be used for the aggregation of multiple raw signal intensities to one expression value for Affymetrix and Illumina data. With regard to the Affymetrix data, we have implemented an algorithm which is a variant of MAS 5.0. Using datasets from the literature and Monte-Carlo simulations we provide some reasoning for assuming approximate log-normal distributions of the raw signal intensities by means of the Kolmogorov distance, at least for the discussed datasets, and compare the results of our preprocessing algorithms with the results of Affymetrix's MAS 5.0 and Illumina's default method. The numerical results indicate that when using rmx estimators an accuracy improvement of about 10-20% is obtained compared to Affymetrix's MAS 5.0 and about 1-5% compared to Illumina's default method. The improvement is also visible in the analysis of technical replicates where the reproducibility of the values (in terms of Pearson and Spearman correlation) is increased for all Affymetrix and almost all Illumina examples considered. Our algorithms are implemented in the R package named RobLoxBioC which is publicly available via CRAN, The Comprehensive R Archive Network (http://cran.r-project.org/web/packages/RobLoxBioC/). Conclusions Optimally robust rmx estimators have a high breakdown point and are computationally feasible. They can lead to a considerable gain in efficiency for well-established bioinformatics procedures and thus, can increase the reproducibility and power of subsequent statistical analysis.
Collapse
Affiliation(s)
- Matthias Kohl
- Department of Mechanical and Process Engineering, Furtwangen University, Jakob-Kienzle-Str, 17, 78054 Villingen-Schwenningen, Germany.
| | | |
Collapse
|
41
|
Küppers M, Ittrich C, Faust D, Dietrich C. The transcriptional programme of contact-inhibition. J Cell Biochem 2010; 110:1234-43. [PMID: 20564218 DOI: 10.1002/jcb.22638] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Proliferation of non-transformed cells is regulated by cell-cell contacts, which are referred to as contact-inhibition. Vice versa, transformed cells are characterised by a loss of contact-inhibition. Despite its generally accepted importance for cell-cycle control, little is known about the intracellular signalling pathways involved in contact-inhibition. Unravelling the molecular mechanisms of contact-inhibition and its loss during tumourigenesis will be an important step towards the identification of novel target genes in tumour diagnosis and treatment. To better understand the underlying molecular mechanisms we identified the transcriptional programme of contact-inhibition in NIH3T3 fibroblast using high-density microarrays. Setting the cut off: >or=1.5-fold, P <or= 0.05, 853 genes and 73 cDNA sequences were differentially expressed in confluent compared to exponentially growing cultures. Importing these data into GenMAPP software revealed a comprehensive list of cell-cycle regulatory genes mediating G0/G1 arrest, which was confirmed by RT-PCR and Western blot. In a narrow analysis (cut off: >or=2-fold, P <or= 0.002), we found 110 transcripts to be differentially expressed representing 107 genes and 3 cDNA sequences involved, for example, in proliferation, signal transduction, transcriptional regulation, cell adhesion and communication. Interestingly, the majority of genes was upregulated indicating that contact-inhibition is not a passive state, but actively induced. Furthermore, we confirmed differential expression of eight genes by semi-quantitative RT-PCR and identified the potential tumour suppressor transforming growth factor-beta (TGF-beta)-1-induced clone 22 (TSC-22; tgfb1i4) as a novel protein to be induced in contact-inhibited cells.
Collapse
Affiliation(s)
- Monika Küppers
- Institute of Toxicology, Medical Center of the Johannes Gutenberg-University, Obere Zahlbacherstr 67, 55131 Mainz, Germany
| | | | | | | |
Collapse
|
42
|
Giorgi FM, Bolger AM, Lohse M, Usadel B. Algorithm-driven artifacts in median polish summarization of microarray data. BMC Bioinformatics 2010; 11:553. [PMID: 21070630 PMCID: PMC2998528 DOI: 10.1186/1471-2105-11-553] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2010] [Accepted: 11/11/2010] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND High-throughput measurement of transcript intensities using Affymetrix type oligonucleotide microarrays has produced a massive quantity of data during the last decade. Different preprocessing techniques exist to convert the raw signal intensities measured by these chips into gene expression estimates. Although these techniques have been widely benchmarked in the context of differential gene expression analysis, there are only few examples where their performance has been assessed in respect to coexpression-based studies such as sample classification. RESULTS In the present paper we benchmark the three most used normalization procedures (MAS5, RMA and GCRMA) in the context of inter-array correlation analysis, confirming and extending the finding that RMA and GCRMA consistently overestimate sample similarity upon normalization. We determine that median polish summarization is responsible for generating a large proportion of these over-similarity artifacts. Furthermore, we show that most affected probesets show also internal signal disagreement, and tend to be composed by individual probes hitting different gene transcripts. We finally provide a correction to the RMA/GCRMA summarization procedure that massively reduces inter-array correlation artifacts, without affecting the detection of differentially expressed genes. CONCLUSIONS We propose tRMA as a modification of RMA to normalize microarray experiments for correlation-based analysis.
Collapse
Affiliation(s)
- Federico M Giorgi
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, 14476 Golm, Germany
| | | | | | | |
Collapse
|
43
|
Abstract
The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high-dimensional data and introduce the basic preprocessing procedures.
Collapse
|
44
|
Aniba MR, Poch O, Thompson JD. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 2010; 38:7353-63. [PMID: 20639539 PMCID: PMC2995051 DOI: 10.1093/nar/gkq625] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Revised: 06/10/2010] [Accepted: 06/29/2010] [Indexed: 11/13/2022] Open
Abstract
The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data.
Collapse
Affiliation(s)
- Mohamed Radhouene Aniba
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| | - Olivier Poch
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| | - Julie D. Thompson
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| |
Collapse
|
45
|
Skrzypczak M, Goryca K, Rubel T, Paziewska A, Mikula M, Jarosz D, Pachlewski J, Oledzki J, Ostrowsk J. Modeling oncogenic signaling in colon tumors by multidirectional analyses of microarray data directed for maximization of analytical reliability. PLoS One 2010; 5. [PMID: 20957034 PMCID: PMC2948500 DOI: 10.1371/journal.pone.0013091] [Citation(s) in RCA: 270] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2010] [Accepted: 09/08/2010] [Indexed: 12/16/2022] Open
Abstract
Background Clinical progression of colorectal cancers (CRC) may occur in parallel with distinctive signaling alterations. We designed multidirectional analyses integrating microarray-based data with biostatistics and bioinformatics to elucidate the signaling and metabolic alterations underlying CRC development in the adenoma-carcinoma sequence. Methodology/Principal Findings Studies were performed on normal mucosa, adenoma, and carcinoma samples obtained during surgery or colonoscopy. Collections of cryostat sections prepared from the tissue samples were evaluated by a pathologist to control the relative cell type content. The measurements were done using Affymetrix GeneChip HG-U133plus2, and probe set data was generated using two normalization algorithms: MAS5.0 and GCRMA with least-variant set (LVS). The data was evaluated using pair-wise comparisons and data decomposition into singular value decomposition (SVD) modes. The method selected for the functional analysis used the Kolmogorov-Smirnov test. Expressional profiles obtained in 105 samples of whole tissue sections were used to establish oncogenic signaling alterations in progression of CRC, while those representing 40 microdissected specimens were used to select differences in KEGG pathways between epithelium and mucosa. Based on a consensus of the results obtained by two normalization algorithms, and two probe set sorting criteria, we identified 14 and 17 KEGG signaling and metabolic pathways that are significantly altered between normal and tumor samples and between benign and malignant tumors, respectively. Several of them were also selected from the raw microarray data of 2 recently published studies (GSE4183 and GSE8671). Conclusion/Significance Although the proposed strategy is computationally complex and labor–intensive, it may reduce the number of false results.
Collapse
Affiliation(s)
- Magdalena Skrzypczak
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
| | - Krzysztof Goryca
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
| | - Tymon Rubel
- Laboratory of Bioinformatics and Systems Biology, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland
| | - Agnieszka Paziewska
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
| | - Michal Mikula
- Department of Oncological Genetics, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland
| | - Dorota Jarosz
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
| | - Jacek Pachlewski
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
| | - Janusz Oledzki
- Department of Colorectal Cancer, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland
| | - Jerzy Ostrowsk
- Department of Gastroenterology and Hepatology, Medical Center for Postgraduate Education, Warsaw, Poland
- Department of Oncological Genetics, Maria Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland
- * E-mail:
| |
Collapse
|
46
|
Eronen VP, Lindén RO, Lindroos A, Kanerva M, Aittokallio T. Genome-wide scoring of positive and negative epistasis through decomposition of quantitative genetic interaction fitness matrices. PLoS One 2010; 5:e11611. [PMID: 20657656 PMCID: PMC2904709 DOI: 10.1371/journal.pone.0011611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2010] [Accepted: 06/22/2010] [Indexed: 01/07/2023] Open
Abstract
Recent technological developments in genetic screening approaches have offered the means to start exploring quantitative genotype-phenotype relationships on a large-scale. What remains unclear is the extent to which the quantitative genetic interaction datasets can distinguish the broad spectrum of interaction classes, as compared to existing information on mutation pairs associated with both positive and negative interactions, and whether the scoring of varying degrees of such epistatic effects could be improved by computational means. To address these questions, we introduce here a computational approach for improving the quantitative discrimination power encoded in the genetic interaction screening data. Our matrix approximation model decomposes the original double-mutant fitness matrix into separate components, representing variability across the array and query mutants, which can be utilized for estimating and correcting the single-mutant fitness effects, respectively. When applied to three large-scale quantitative interaction datasets in yeast, we could improve the accuracy of scoring various interaction classes beyond that obtained with the original fitness data, especially in synthetic genetic array (SGA) and in genetic interaction mapping (GIM) datasets. In addition to the known pairs of interactions used in the evaluation of the computational approach, a number of novel interaction pairs were also predicted, along with underlying biological mechanisms, which remained undetected by the original datasets. It was shown that the optimal choice of the scoring function depends heavily on the screening approach and on the interaction class under analysis. Moreover, a simple preprocessing of the fitness matrix could further enhance the discrimination power of the epistatic miniarray profiling (E-MAP) dataset. These systematic evaluation results provide in-depth information on the optimal analysis of the future, large-scale screening experiments. In general, the modeling framework, enabling accurate identification and classification of genetic interactions, provides a solid basis for completing and mining the genetic interaction networks in yeast and other organisms.
Collapse
Affiliation(s)
- Ville-Pekka Eronen
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
| | - Rolf O. Lindén
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
- Data Mining and Modeling Group, Turku Centre for Biotechnology, University of Turku, Turku, Finland
| | - Anna Lindroos
- Division of Genetics and Physiology, Department of Biology, University of Turku, Turku, Finland
| | - Mirella Kanerva
- Division of Genetics and Physiology, Department of Biology, University of Turku, Turku, Finland
| | - Tero Aittokallio
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
- Data Mining and Modeling Group, Turku Centre for Biotechnology, University of Turku, Turku, Finland
- * E-mail:
| |
Collapse
|
47
|
Schmid R, Baum P, Ittrich C, Fundel-Clemens K, Huber W, Brors B, Eils R, Weith A, Mennerich D, Quast K. Comparison of normalization methods for Illumina BeadChip HumanHT-12 v3. BMC Genomics 2010; 11:349. [PMID: 20525181 PMCID: PMC3091625 DOI: 10.1186/1471-2164-11-349] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 06/02/2010] [Indexed: 11/26/2022] Open
Abstract
Background Normalization of microarrays is a standard practice to account for and minimize effects which are not due to the controlled factors in an experiment. There is an overwhelming number of different methods that can be applied, none of which is ideally suited for all experimental designs. Thus, it is important to identify a normalization method appropriate for the experimental setup under consideration that is neither too negligent nor too stringent. Major aim is to derive optimal results from the underlying experiment. Comparisons of different normalization methods have already been conducted, none of which, to our knowledge, comparing more than a handful of methods. Results In the present study, 25 different ways of pre-processing Illumina Sentrix BeadChip array data are compared. Among others, methods provided by the BeadStudio software are taken into account. Looking at different statistical measures, we point out the ideal versus the actual observations. Additionally, we compare qRT-PCR measurements of transcripts from different ranges of expression intensities to the respective normalized values of the microarray data. Taking together all different kinds of measures, the ideal method for our dataset is identified. Conclusions Pre-processing of microarray gene expression experiments has been shown to influence further downstream analysis to a great extent and thus has to be carefully chosen based on the design of the experiment. This study provides a recommendation for deciding which normalization method is best suited for a particular experimental setup.
Collapse
Affiliation(s)
- Ramona Schmid
- Boehringer Ingelheim Pharma GmbH & Co, KG, Birkendorfer Str, 65, 88397 Biberach/Riss, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Kim C, Choi J, Park H, Park Y, Park J, Park T, Cho K, Yang Y, Yoon S. Global analysis of microarray data reveals intrinsic properties in gene expression and tissue selectivity. ACTA ACUST UNITED AC 2010; 26:1723-30. [PMID: 20511364 DOI: 10.1093/bioinformatics/btq279] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION It is expected that individual genes have intrinsically different variability in the global expressional trend among them. Thus, the consideration of gene-specific expressional properties will help us to distinguish target-selective gene expression over non-selective over-expression. RESULTS The re-standardization and integration of heterogeneous microarray datasets, available from public databases, have enabled us to determine the global expression properties of individual genes across a wide variety of experimental conditions and samples. The global averages and SDs of expression for each gene in the integrated microarray datasets were found to be intrinsic properties, which were consistent among independent collections of datasets using different microarray platforms. Using the gene-specific intrinsic parameters to rescale the microarray data, we were able to distinguish novel selective gene expression [cartilage oligomeric matrix protein (COMP) and Collagen X] in breast cancer tissues from non-selective over-expression, a difference that has not been detectable by conventional methods. AVAILABILITY AND IMPLEMENTATION The web-based tool for GS-LAGE is available at http://lage.sookmyung.ac.kr
Collapse
Affiliation(s)
- Changsik Kim
- Department of Biological Sciences, Sookmyung Women's University, Seoul, Republic of Korea
| | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Affiliation(s)
- Mark Reimers
- Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia, United States of America.
| |
Collapse
|
50
|
Zhu CQ, Pintilie M, John T, Strumpf D, Shepherd FA, Der SD, Jurisica I, Tsao MS. Understanding prognostic gene expression signatures in lung cancer. Clin Lung Cancer 2010; 10:331-40. [PMID: 19808191 DOI: 10.3816/clc.2009.n.045] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
In non-small-cell lung cancer (NSCLC), molecular profiling of tumors has led to the identification of gene expression patterns that are associated with specific phenotypes and prognosis. Such correlations could identify early-stage patients who are at increased risk of disease recurrence and death after complete surgical resection and who might benefit from adjuvant therapy. Profiling may also identify aberrant molecular pathways that might lead to specific molecularly targeted therapies. The technology behind the capturing and correlating of molecular profiles with clinical and biologic endpoints have evolved rapidly since microarrays were first developed a decade ago. In this review, we discuss multiple methods that have been used to derive prognostic gene expression signatures in NSCLC. Despite the diversity in the approaches used, 3 main steps are followed. First, the expression levels of several hundred to tens of thousands of genes are quantified by microarray or quantitative polymerase chain reaction techniques; the data are then preprocessed, normalized, and possibly filtered. In the second step, expression data are combined and grouped by clustering, risk score generation, or other means, to generate a gene signature that correlates with a clinical outcome, usually survival. Finally, the signature is validated in datasets of independent cohorts. This review discusses the concepts and methodologies involved in these analytical steps, primarily to facilitate the understanding of reports on large dataset gene expression studies that focus on prognostic signatures in NSCLC.
Collapse
Affiliation(s)
- Chang-Qi Zhu
- University Health Network, Ontario Cancer Institute/Princess Margaret Hospital, Ontario, Canada
| | | | | | | | | | | | | | | |
Collapse
|