1
|
Forouzandeh A, Rutar A, Kalmady SV, Greiner R. Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets. PLoS One 2022; 17:e0252697. [PMID: 35901020 PMCID: PMC9333302 DOI: 10.1371/journal.pone.0252697] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 06/29/2022] [Indexed: 11/19/2022] Open
Abstract
Many researchers try to understand a biological condition by identifying biomarkers. This is typically done using univariate hypothesis testing over a labeled dataset, declaring a feature to be a biomarker if there is a significant statistical difference between its values for the subjects with different outcomes. However, such sets of proposed biomarkers are often not reproducible – subsequent studies often fail to identify the same sets. Indeed, there is often only a very small overlap between the biomarkers proposed in pairs of related studies that explore the same phenotypes over the same distribution of subjects. This paper first defines the Reproducibility Score for a labeled dataset as a measure (taking values between 0 and 1) of the reproducibility of the results produced by a specified fixed biomarker discovery process for a given distribution of subjects. We then provide ways to reliably estimate this score by defining algorithms that produce an over-bound and an under-bound for this score for a given dataset and biomarker discovery process, for the case of univariate hypothesis testing on dichotomous groups. We confirm that these approximations are meaningful by providing empirical results on a large number of datasets and show that these predictions match known reproducibility results. To encourage others to apply this technique to analyze their biomarker sets, we have also created a publicly available website, https://biomarker.shinyapps.io/BiomarkerReprod/, that produces these Reproducibility Score approximations for any given dataset (with continuous or discrete features and binary class labels).
Collapse
Affiliation(s)
- Amir Forouzandeh
- Department of Computing Science, University of Alberta, Edmonton, Canada
- * E-mail:
| | - Alex Rutar
- Department of Pure Math, University of Waterloo, Waterloo, ON, Canada
| | - Sunil V. Kalmady
- Department of Computing Science, University of Alberta, Edmonton, Canada
- Canadian VIGOUR Centre, University of Alberta, Edmonton, Canada
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, Canada
- Alberta Machine Intelligence Institute, Edmonton, Canada
| |
Collapse
|
2
|
Rezaei Tavirani M, Zamanian Azodi M, Rostami-Nejad M, Morravej H, Razzaghi Z, Okhovatian F, Rezaei-Tavirani M. Introducing Serine as Cardiovascular Disease Biomarker Candidate via Pathway Analysis. Galen Med J 2020; 9:e1696. [PMID: 34466570 PMCID: PMC8343801 DOI: 10.31661/gmj.v9i0.1696] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Revised: 10/01/2019] [Accepted: 12/03/2019] [Indexed: 11/25/2022] Open
Abstract
Background: The rate of death due to cardiovascular disease (CVD) is growing. Investigations about CVD that leading to introduce varieties of metabolites is available. The monitoring of these metabolites to find effective ones in the future of clinic applications is the main aim of this study. Materials and Methods: Numbers of 34 metabolites for the CVD are extracted from literature and designated for interaction determinations by MetScape V 3.1.3. The compound-reaction-enzyme-gene network was constructed and the pathways were analyzed. Based on the presence of metabolites in the pathways the critical compounds were determined. Results: Pathway analysis revealed 18 disturbed pathways related to the CVD. glycerophospholipid metabolism pathway including 27 compounds is related to the 9 queried metabolites. L-Serine which was communed between 5 pathways and also was presented in the largest pathway was identified as the critical compound. Conclusion: It can be concluded that L-Serine is a proper biomarker candidate for CVD diagnosis and also patients follow up approaches.
Collapse
Affiliation(s)
- Mostafa Rezaei Tavirani
- Proteomics Research Center, Faculty of Paramedical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mona Zamanian Azodi
- Proteomics Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Correspondence to: Mona Zamanian Azodi, Proteomics Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran Telephone Number: +982122714248 Email Address:
| | - Mohammad Rostami-Nejad
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Hamideh Morravej
- Skin Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Zahra Razzaghi
- Laser Application in Medical Sciences Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Farshad Okhovatian
- Physiotherapy Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Majid Rezaei-Tavirani
- Proteomics Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
3
|
Zhang Y, Zhang J, Ju S, Qiu L. Identifying biomarker candidates of influenza infection based on scalable time‐course big data of gene expression. Comput Intell 2019. [DOI: 10.1111/coin.12226] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Yuan Zhang
- School of Information Science and EngineeringUniversity of Jinan Jinan China
- Shandong Provincial Key Laboratory of Network Based Intelligent ComputingUniversity of Jinan Jinan China
| | - Jin Zhang
- School of Information Science and EngineeringUniversity of Jinan Jinan China
- Shandong Provincial Key Laboratory of Network Based Intelligent ComputingUniversity of Jinan Jinan China
| | - Shan Ju
- School of International Trade and EconomicsShandong University of Finance and Economics Jinan China
| | - Lu Qiu
- School of Finance and BusinessShanghai Normal University Shanghai China
| |
Collapse
|
4
|
Novianti PW, Jong VL, Roes KCB, Eijkemans MJC. Meta-analysis approach as a gene selection method in class prediction: does it improve model performance? A case study in acute myeloid leukemia. BMC Bioinformatics 2017; 18:210. [PMID: 28399794 PMCID: PMC5387259 DOI: 10.1186/s12859-017-1619-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Accepted: 03/30/2017] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Aggregating gene expression data across experiments via meta-analysis is expected to increase the precision of the effect estimates and to increase the statistical power to detect a certain fold change. This study evaluates the potential benefit of using a meta-analysis approach as a gene selection method prior to predictive modeling in gene expression data. RESULTS Six raw datasets from different gene expression experiments in acute myeloid leukemia (AML) and 11 different classification methods were used to build classification models to classify samples as either AML or healthy control. First, the classification models were trained on gene expression data from single experiments using conventional supervised variable selection and externally validated with the other five gene expression datasets (referred to as the individual-classification approach). Next, gene selection was performed through meta-analysis on four datasets, and predictive models were trained with the selected genes on the fifth dataset and validated on the sixth dataset. For some datasets, gene selection through meta-analysis helped classification models to achieve higher performance as compared to predictive modeling based on a single dataset; but for others, there was no major improvement. Synthetic datasets were generated from nine simulation scenarios. The effect of sample size, fold change and pairwise correlation between differentially expressed (DE) genes on the difference between MA- and individual-classification model was evaluated. The fold change and pairwise correlation significantly contributed to the difference in performance between the two methods. The gene selection via meta-analysis approach was more effective when it was conducted using a set of data with low fold change and high pairwise correlation on the DE genes. CONCLUSION Gene selection through meta-analysis on previously published studies potentially improves the performance of a predictive model on a given gene expression data.
Collapse
Affiliation(s)
- Putri W. Novianti
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA Utrecht, The Netherlands
- Department of Epidemiology and Biostatistics, VU University medical center, Amsterdam, The Netherlands
- Department of Pathology, VU University medical center, Amsterdam, The Netherlands
| | - Victor L. Jong
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA Utrecht, The Netherlands
- Viroscience Laboratory, Erasmus Medical Center Rotterdam, 3015 CE Rotterdam, The Netherlands
| | - Kit C. B. Roes
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA Utrecht, The Netherlands
| | - Marinus J. C. Eijkemans
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA Utrecht, The Netherlands
| |
Collapse
|
5
|
Canevari RA, Marchi FA, Domingues MAC, de Andrade VP, Caldeira JRF, Verjovski-Almeida S, Rogatto SR, Reis EM. Identification of novel biomarkers associated with poor patient outcomes in invasive breast carcinoma. Tumour Biol 2016; 37:13855-13870. [PMID: 27485113 DOI: 10.1007/s13277-016-5133-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 07/06/2016] [Indexed: 12/20/2022] Open
Abstract
Breast carcinoma (BC) corresponds to 23 % of all cancers in women, with 1.38 million new cases and 460,000 deaths worldwide annually. Despite the significant advances in the identification of molecular markers and different modalities of treatment for primary BC, the ability to predict its metastatic behavior is still limited. The purpose of this study was to identify novel molecular markers associated with distinct clinical outcomes in a Brazilian cohort of BC patients. We generated global gene expression profiles using tumor samples from 24 patients with invasive ductal BC who were followed for at least 5 years, including a group of 15 patients with favorable outcomes and another with nine patients who developed metastasis. We identified a set of 58 differentially expressed genes (p ≤ 0.01) between the two groups. The prognostic value of this metastasis signature was corroborated by its ability to stratify independent BC patient datasets according to disease-free survival and overall survival. The upregulation of B3GNT7, PPM1D, TNKS2, PHB, and GTSE1 in patients with poor outcomes was confirmed by quantitative reverse transcription polymerase chain reaction (RT-qPCR) in an independent sample of patients with BC (47 with good outcomes and eight that presented metastasis). The expression of BCL2-associated agonist of cell death (BAD) protein was determined in 1276 BC tissue samples by immunohistochemistry and was consistent with the reduced BAD mRNA expression levels in metastatic cases, as observed in the oligoarray data. These findings point to novel prognostic markers that can distinguish breast carcinomas with metastatic potential from those with favorable outcomes.
Collapse
Affiliation(s)
- Renata A Canevari
- Instituto de Pesquisa e Desenvolvimento, Universidade do Vale do Paraíba, São José dos Campos, SP, 12244-000, Brazil
| | - Fabio A Marchi
- CIPE - AC Camargo Cancer Center, São Paulo, SP, 01508-010, Brazil
| | - Maria A C Domingues
- Departamento de Patologia, Faculdade de Medicina, Universidade do Estado de São Paulo - UNESP, Botucatu, SP, 18618-000, Brazil
| | | | - José R F Caldeira
- Departamento de Senologia, Hospital Amaral Carvalho, Jaú, SP, 17210-080, Brazil
| | - Sergio Verjovski-Almeida
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo - USP, Av. Prof. Lineu Prestes, 748, Cidade Universitaria, São Paulo, SP, 05508-900, Brazil.,Instituto Butantan, São Paulo, SP, 05503-900, Brazil
| | - Silvia R Rogatto
- CIPE - AC Camargo Cancer Center, São Paulo, SP, 01508-010, Brazil. .,Department of Clinical Genetics Vejle Sygehus, Vejle, Denmark. .,Institute of Regional Health, University of Southern Denmark, Vejle, Denmark.
| | - Eduardo M Reis
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo - USP, Av. Prof. Lineu Prestes, 748, Cidade Universitaria, São Paulo, SP, 05508-900, Brazil.
| |
Collapse
|
6
|
Wei H, Zhou Y, Jiang S, Huang F, Peng J, Jiang S. Transcriptional response of porcine skeletal muscle to feeding a linseed-enriched diet to growing pigs. J Anim Sci Biotechnol 2016; 7:6. [PMID: 26862397 PMCID: PMC4746901 DOI: 10.1186/s40104-016-0064-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 01/22/2016] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND To investigate the effect of feeding a linseed-enriched diet to growing-finishing pigs on gene expression in skeletal muscle, pigs were fed with a linseed-enriched diet for 0, 30, 60 and 90 d. Transcriptional profiles of longissimus dorsi muscle were measured using Affymetrix Genechip. RESULTS Results showed that 264 genes were identified as differentially expressed genes (DEGs). The strongest transcriptional response was clearly observed at 30 d. DEGs were assigned to several main functional terms, including transcription, apoptosis, intracellular receptor-mediated signaling, muscle organ development, fatty acid metabolic process, cell motion, regulation of glucose metabolic process, spermatogenesis and regulation of myeloid cell differentiation. We also found that transcriptional changs of several transcription cofactors might contribute to n-3 PUFAs regulated gene expression. In addition, the increased expression of IGF-1, insulin signaling pathway and the metabolism of amino acids might involve in the muscle growth induced by feeding a linseed-enriched diet. The results also provide the new evidence that the expression changes of PTPN1, HK2 and PGC-1α might contribute to the regulation of insulin sensitivity by n-3 PUFAs. CONCLUSIONS Our finding provided correlative evidence that feeding the linseed enriched diet affact expression of genes involved in insulin signaling pathway and the metabolism of amino acids.
Collapse
Affiliation(s)
- Hongkui Wei
- />Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| | - Yuanfei Zhou
- />Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| | - Shuzhong Jiang
- />Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| | - Feiruo Huang
- />Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| | - Jian Peng
- />Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| | - Siwen Jiang
- />Key Lab of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, 430070 P. R. China
| |
Collapse
|
7
|
Jagga Z, Gupta D. Machine learning for biomarker identification in cancer research - developments toward its clinical application. Per Med 2015; 12:371-387. [PMID: 29771660 DOI: 10.2217/pme.15.5] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The patterns identified from the systematically collected molecular profiles of patient tumor samples, along with clinical metadata, can assist personalized treatments for effective management of cancer patients with similar molecular subtypes. There is an unmet need to develop computational algorithms for cancer diagnosis, prognosis and therapeutics that can identify complex patterns and help in classifications based on plethora of emerging cancer research outcomes in public domain. Machine learning, a branch of artificial intelligence, holds a great potential for pattern recognition in cryptic cancer datasets, as evident from recent literature survey. In this review, we focus on the current status of machine learning applications in cancer research, highlighting trends and analyzing major achievements, roadblocks and challenges toward its implementation in clinics.
Collapse
Affiliation(s)
- Zeenia Jagga
- Bioinformatics Laboratory, Structural & Computational Biology Group, International Centre for Genetic Engineering & Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110 067, India
| | - Dinesh Gupta
- Bioinformatics Laboratory, Structural & Computational Biology Group, International Centre for Genetic Engineering & Biotechnology (ICGEB), Aruna Asaf Ali Marg, New Delhi 110 067, India
| |
Collapse
|
8
|
Ow GS, Kuznetsov VA. Multiple signatures of a disease in potential biomarker space: Getting the signatures consensus and identification of novel biomarkers. BMC Genomics 2015; 16 Suppl 7:S2. [PMID: 26100469 PMCID: PMC4474413 DOI: 10.1186/1471-2164-16-s7-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background The lack of consensus among reported gene signature subsets (GSSs) in multi-gene biomarker discovery studies is often a concern for researchers and clinicians. Subsequently, it discourages larger scale prospective studies, prevents the translation of such knowledge into a practical clinical setting and ultimately hinders the progress of the field of biomarker-based disease classification, prognosis and prediction. Methods We define all "gene identificators" (gIDs) as constituents of the entire potential disease biomarker space. For each gID in a GSS of interest ("tested GSS"/tGSS), our method counts the empirical frequency of gID co-occurrences/overlaps in other reference GSSs (rGSSs) and compares it with the expected frequency generated via implementation of a randomized sampling procedure. Comparison of the empirical frequency distribution (EFD) with the expected background frequency distribution (BFD) allows dichotomization of statistically novel (SN) and common (SC) gIDs within the tGSS. Results We identify SN or SC biomarkers for tGSSs obtained from previous studies of high-grade serous ovarian cancer (HG-SOC) and breast cancer (BC). For each tGSS, the EFD of gID co-occurrences/overlaps with other rGSSs is characterized by scale and context-dependent Pareto-like frequency distribution function. Our results indicate that while independently there is little overlap between our tGSS with individual rGSSs, comparison of the EFD with BFD suggests that beyond a confidence threshold, tested gIDs become more common in rGSSs than expected. This validates the use of our tGSS as individual or combined prognostic factors. Our method identifies SN and SC genes of a 36-gene prognostic signature that stratify HG-SOC patients into subgroups with low, intermediate or high-risk of the disease outcome. Using 70 BC rGSSs, the method also predicted SN and SC BC prognostic genes from the tested obesity and IGF1 pathway GSSs. Conclusions Our method provides a strategy that identify/predict within a tGSS of interest, gID subsets that are either SN or SC when compared to other rGSSs. Practically, our results suggest that there is a stronger association of the IGF1 signature genes with the 70 BC rGSSs, than for the obesity-associated signature. Furthermore, both SC and SN genes, in both signatures could be considered as perspective prognostic biomarkers of BCs that stratify the patients onto low or high risks of cancer development.
Collapse
|
9
|
Hewitt SC, Winuthayanon W, Pockette B, Kerns RT, Foley JF, Flagler N, Ney E, Suksamrarn A, Piyachaturawat P, Bushel PR, Korach KS. Development of phenotypic and transcriptional biomarkers to evaluate relative activity of potentially estrogenic chemicals in ovariectomized mice. ENVIRONMENTAL HEALTH PERSPECTIVES 2015; 123:344-352. [PMID: 25575267 PMCID: PMC4383572 DOI: 10.1289/ehp.1307935] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Accepted: 01/08/2015] [Indexed: 05/30/2023]
Abstract
BACKGROUND Concerns regarding potential endocrine-disrupting chemicals (EDCs) have led to a need for methods to evaluate candidate estrogenic chemicals. Our previous evaluations of two such EDCs revealed a response similar to that of estradiol (E2) at 2 hr, but a less robust response at 24 hr, similar to the short-acting estrogen estriol (E3). OBJECTIVES Microarray analysis using tools to recognize patterns of response have been utilized in the cancer field to develop biomarker panels of transcripts for diagnosis and selection of treatments most likely to be effective. Biological effects elicited by long- versus short-acting estrogens greatly affect the risks associated with exposures; therefore, we sought to develop tools to predict the ability of chemicals to maintain estrogenic responses. METHODS We used biological end points in uterine tissue and a signature pattern-recognizing tool that identified coexpressed transcripts to develop and test a panel of transcripts in order to classify potentially estrogenic compounds using an in vivo system. The end points used are relevant to uterine tissue, but the resulting classification of the compounds is important for other sensitive tissues and species. RESULTS We evaluated biological and transcriptional end points with proven short- and long-acting estrogens and verified the use of our approach using a phytoestrogen. With our model, we were able to classify the diarylheptanoid D3 as a short-acting estrogen. CONCLUSIONS We have developed a panel of transcripts as biomarkers which, together with biological end points, might be used to screen and evaluate potentially estrogenic chemicals and infer mode of activity.
Collapse
Affiliation(s)
- Sylvia C Hewitt
- Receptor Biology, Reproductive and Developmental Biology Laboratory
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Zeng T, Zhang W, Yu X, Liu X, Li M, Liu R, Chen L. Edge biomarkers for classification and prediction of phenotypes. SCIENCE CHINA-LIFE SCIENCES 2014; 57:1103-14. [PMID: 25326072 DOI: 10.1007/s11427-014-4757-4] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 08/07/2014] [Indexed: 12/19/2022]
Abstract
In general, a disease manifests not from malfunction of individual molecules but from failure of the relevant system or network, which can be considered as a set of interactions or edges among molecules. Thus, instead of individual molecules, networks or edges are stable forms to reliably characterize complex diseases. This paper reviews both traditional node biomarkers and edge biomarkers, which have been newly proposed. These biomarkers are classified in terms of their contained information. In particular, we show that edge and network biomarkers provide novel ways of stably and reliably diagnosing the disease state of a sample. First, we categorize the biomarkers based on the information used in the learning and prediction steps. We then briefly introduce conventional node biomarkers, or molecular biomarkers without network information, and their computational approaches. The main focus of this paper is edge and network biomarkers, which exploit network information to improve the accuracy of diagnosis and prognosis. Moreover, by extracting both network and dynamic information from the data, we can develop dynamical network and edge biomarkers. These biomarkers not only diagnose the immediate pre-disease state but also detect the critical molecules or networks by which the biological system progresses from the healthy to the disease state. The identified critical molecules can be used as drug targets, and the critical state indicates the critical point of disease control. The paper also discusses representative biomarker-based methods.
Collapse
Affiliation(s)
- Tao Zeng
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | | | | | | | | | | | | |
Collapse
|
11
|
Fang S, Fang X, Xiong M. Psoriasis prediction from genome-wide SNP profiles. BMC DERMATOLOGY 2011; 11:1. [PMID: 21214922 PMCID: PMC3022824 DOI: 10.1186/1471-5945-11-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Accepted: 01/07/2011] [Indexed: 11/10/2022]
Abstract
BACKGROUND With the availability of large-scale genome-wide association study (GWAS) data, choosing an optimal set of SNPs for disease susceptibility prediction is a challenging task. This study aimed to use single nucleotide polymorphisms (SNPs) to predict psoriasis from searching GWAS data. METHODS Totally we had 2,798 samples and 451,724 SNPs. Process for searching a set of SNPs to predict susceptibility for psoriasis consisted of two steps. The first one was to search top 1,000 SNPs with high accuracy for prediction of psoriasis from GWAS dataset. The second one was to search for an optimal SNP subset for predicting psoriasis. The sequential information bottleneck (sIB) method was compared with classical linear discriminant analysis(LDA) for classification performance. RESULTS The best test harmonic mean of sensitivity and specificity for predicting psoriasis by sIB was 0.674(95% CI: 0.650-0.698), while only 0.520(95% CI: 0.472-0.524) was reported for predicting disease by LDA. Our results indicate that the new classifier sIB performs better than LDA in the study. CONCLUSIONS The fact that a small set of SNPs can predict disease status with average accuracy of 68% makes it possible to use SNP data for psoriasis prediction.
Collapse
Affiliation(s)
- Shenying Fang
- Department of Epidemiology, The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030, USA
| | - Xiangzhong Fang
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
- School of Mathematical Sciences, Peking University, Beijing 100871, P.R. China
| | - Momiao Xiong
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| |
Collapse
|
12
|
Dawany NB, Tozeren A. Asymmetric microarray data produces gene lists highly predictive of research literature on multiple cancer types. BMC Bioinformatics 2010; 11:483. [PMID: 20875095 PMCID: PMC2949900 DOI: 10.1186/1471-2105-11-483] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 09/27/2010] [Indexed: 12/02/2022] Open
Abstract
Background Much of the public access cancer microarray data is asymmetric, belonging to datasets containing no samples from normal tissue. Asymmetric data cannot be used in standard meta-analysis approaches (such as the inverse variance method) to obtain large sample sizes for statistical power enrichment. Noting that plenty of normal tissue microarray samples exist in studies not involving cancer, we investigated the viability and accuracy of an integrated microarray analysis approach based on significance analysis of microarrays (merged SAM) using a collection of data from separate diseased and normal samples. Results We focused on five solid cancer types (colon, kidney, liver, lung, and pancreas), where available microarray data allowed us to compare meta-analysis and integrated approaches. Our results from the merged SAM significantly overlapped gene lists from the validated inverse-variance method. Both meta-analysis and merged SAM approaches successfully captured the aberrances in the cell cycle that commonly occur in the different cancer types. However, the integrated SAM analysis replicated the known cancer literature (excluding microarray studies) with much more accuracy than the meta-analysis. Conclusion The merged SAM test is a powerful, robust approach for combining data from similar platforms and for analyzing asymmetric datasets, including those with only normal or only cancer samples that cannot be utilized by meta-analysis methods. The integrated SAM approach can also be used in comparing global gene expression between various subtypes of cancer arising from the same tissue.
Collapse
Affiliation(s)
- Noor B Dawany
- Center for Integrated Bioinformatics, Drexel University, Bossone Research Building 711, 3102 Market Street, Philadelphia, PA 19104, USA
| | | |
Collapse
|
13
|
Liu Y, Tozeren A. Modular composition predicts kinase/substrate interactions. BMC Bioinformatics 2010; 11:349. [PMID: 20579376 PMCID: PMC2912303 DOI: 10.1186/1471-2105-11-349] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2009] [Accepted: 06/25/2010] [Indexed: 01/14/2023] Open
Abstract
Background Phosphorylation events direct the flow of signals and metabolites along cellular protein networks. Current annotations of kinase-substrate binding events are far from complete. In this study, we scanned the entire human protein sequences using the PROSITE domain annotation tool to identify patterns of domain composition in kinases and their substrates. We identified statistically enriched pairs of strings of domains (signature pairs) in kinase-substrate couples presented in the 2006 version of the PTM database. Results The signature pairs enriched in kinase - substrate binding interactions turned out to be highly specific to kinase subtypes. The resulting list of signature pairs predicted kinase-substrate interactions in validation dataset not used in learning with high statistical accuracy. Conclusions The method presented here produces predictions of protein phosphorylation events with high accuracy and mid-level coverage. Our method can be used in expanding the currently available drafts of cell signaling pathways and thus will be an important tool in the development of combination drug therapies targeting complex diseases.
Collapse
Affiliation(s)
- Yichuan Liu
- Center for Integrated Bioinformatics, Drexel University, Philadelphia, PA 19104, USA
| | | |
Collapse
|
14
|
Blazadonakis ME, Zervakis ME. Comparison and unification of genomic signatures in breast cancer. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2010; 2009:3869-72. [PMID: 19963602 DOI: 10.1109/iembs.2009.5332633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The concept of deriving a gene signature in breast cancer has been addressed by different research groups, each one proposing a different solution with minor overlap among them. There is still an open issue of unifying results among different research groups. In this study we evaluate two published signatures, namely the 70 gene signature of Netherlands group and a 57 gene signature published in our previous study and propose an evaluation platform under which the underlined signatures could be compared effectively. After such an evaluation, we proceed with a unified signature and assess its performance with improved efficiency over the initial signatures.
Collapse
|
15
|
Zervakis M, Blazadonakis ME, Tsiliki G, Danilatou V, Tsiknakis M, Kafetzopoulos D. Outcome prediction based on microarray analysis: a critical perspective on methods. BMC Bioinformatics 2009; 10:53. [PMID: 19200394 PMCID: PMC2667512 DOI: 10.1186/1471-2105-10-53] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2008] [Accepted: 02/07/2009] [Indexed: 11/26/2022] Open
Abstract
Background Information extraction from microarrays has not yet been widely used in diagnostic or prognostic decision-support systems, due to the diversity of results produced by the available techniques, their instability on different data sets and the inability to relate statistical significance with biological relevance. Thus, there is an urgent need to address the statistical framework of microarray analysis and identify its drawbacks and limitations, which will enable us to thoroughly compare methodologies under the same experimental set-up and associate results with confidence intervals meaningful to clinicians. In this study we consider gene-selection algorithms with the aim to reveal inefficiencies in performance evaluation and address aspects that can reduce uncertainty in algorithmic validation. Results A computational study is performed related to the performance of several gene selection methodologies on publicly available microarray data. Three basic types of experimental scenarios are evaluated, i.e. the independent test-set and the 10-fold cross-validation (CV) using maximum and average performance measures. Feature selection methods behave differently under different validation strategies. The performance results from CV do not mach well those from the independent test-set, except for the support vector machines (SVM) and the least squares SVM methods. However, these wrapper methods achieve variable (often low) performance, whereas the hybrid methods attain consistently higher accuracies. The use of an independent test-set within CV is important for the evaluation of the predictive power of algorithms. The optimal size of the selected gene-set also appears to be dependent on the evaluation scheme. The consistency of selected genes over variation of the training-set is another aspect important in reducing uncertainty in the evaluation of the derived gene signature. In all cases the presence of outlier samples can seriously affect algorithmic performance. Conclusion Multiple parameters can influence the selection of a gene-signature and its predictive power, thus possible biases in validation methods must always be accounted for. This paper illustrates that independent test-set evaluation reduces the bias of CV, and case-specific measures reveal stability characteristics of the gene-signature over changes of the training set. Moreover, frequency measures on gene selection address the algorithmic consistency in selecting the same gene signature under different training conditions. These issues contribute to the development of an objective evaluation framework and aid the derivation of statistically consistent gene signatures that could eventually be correlated with biological relevance. The benefits of the proposed framework are supported by the evaluation results and methodological comparisons performed for several gene-selection algorithms on three publicly available datasets.
Collapse
Affiliation(s)
- Michalis Zervakis
- Technical University of Crete, Department of Electronic and Computer Engineering, University Campus, Chania, Crete, Greece.
| | | | | | | | | | | |
Collapse
|
16
|
Meta-analysis of gene expression profiles related to relapse-free survival in 1,079 breast cancer patients. Breast Cancer Res Treat 2008; 118:433-41. [PMID: 19052860 DOI: 10.1007/s10549-008-0242-8] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2008] [Accepted: 10/28/2008] [Indexed: 10/21/2022]
Abstract
The transcriptome of breast cancers have been extensively screened with microarrays and large sets of genes associated with clinical features have been established. The aim of this study was to validate original gene sets on a large cohort of raw breast cancer microarray data with known clinical follow-up. We recovered 20 publications and matched them to Affymetrix HGU133A annotations. Raw Affymetrix HGU133A microarray data were extracted from GEO and MAS5 normalized. For classifying patients using the selected gene sets, we applied prediction analysis of microarrays and constructed Kaplan-Meier plots. A new classification including all patients was generated using supervised principal components analysis. Seven studies including 1,470 patients were downloaded from GEO. Notably, we uncovered 641 microarrays representing 251 individual tumor specimens among them, which were repeatedly described under independent GEO identifiers. We excluded all redundant data and used the remaining 1,079 samples. Eight of the 20 gene sets were able to predict response at a significance of P < 0.05. The discrimination of good and poor prognosis groups exclusively relying on gene expression data resulted in high significance (P = 1.8E-12). A model including genes fitted by both gene expression and clinical covariates (lymph node status and grade) contains 44 genes and can predict response at P = 9.5E-7. The outcome provides a ranking of the gene lists regarding applicability on an independent dataset. We established a consensus predictor combining the available clinical and gene expression data. The database comprising expression profiles of 1,079 breast cancers can be used to classify individual patients.
Collapse
|
17
|
Sivaraksa M, Lowe D. Predictive gene lists for breast cancer prognosis: a topographic visualisation study. BMC Med Genomics 2008; 1:8. [PMID: 18419801 PMCID: PMC2375896 DOI: 10.1186/1755-8794-1-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 04/17/2008] [Indexed: 11/10/2022] Open
Abstract
Background The controversy surrounding the non-uniqueness of predictive gene lists (PGL) of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged [1]. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists. Methods We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE) and the Locally Linear Embedding(LLE) techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether a-posteriori two prognosis groups are separable on the evidence of the gene lists. A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset. Results The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results. Conclusion The random correlation effect to an arbitrary outcome induced by small subset selection from very high dimensional interrelated gene expression profiles leads to an outcome with associated uncertainty. This continuum and uncertainty precludes any attempts at constructing discriminative classifiers. However a patient's gene expression profile could possibly be used in treatment planning, based on knowledge of other patients' responses. We conclude that many of the patients involved in such medical studies are intrinsically unclassifiable on the basis of provided PGL evidence. This additional category of 'unclassifiable' should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided.
Collapse
|