1
|
Bouwmeester R, Richardson K, Denny R, Wilson ID, Degroeve S, Martens L, Vissers JPC. Predicting ion mobility collision cross sections and assessing prediction variation by combining conventional and data driven modeling. Talanta 2024; 274:125970. [PMID: 38621320 DOI: 10.1016/j.talanta.2024.125970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/01/2024] [Accepted: 03/20/2024] [Indexed: 04/17/2024]
Abstract
The use of collision cross section (CCS) values derived from ion mobility studies is proving to be an increasingly important tool in the characterization and identification of molecules detected in complex mixtures. Here, a novel machine learning (ML) based method for predicting CCS integrating both molecular modeling (MM) and ML methodologies has been devised and shown to be able to accurately predict CCS values for singly charged small molecular weight molecules from a broad range of chemical classes. The model performed favorably compared to existing models, improving compound identifications for isobaric analytes in terms of ranking and assigning identification probability values to the annotation. Furthermore, charge localization was seen to be correlated with CCS prediction accuracy and with gas-phase proton affinity demonstrating the potential to provide a proxy for prediction error based on chemical structural properties. The presented approach and findings represent a further step towards accurate prediction and application of computationally generated CCS values.
Collapse
Affiliation(s)
- Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium.
| | | | | | - Ian D Wilson
- Computational & Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College, United Kingdom
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium; Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | | |
Collapse
|
2
|
Rentroia-Pacheco B, Tjien-Fooh FJ, Quattrocchi E, Kobic A, Wever R, Bellomo D, Meves A, Hieken TJ. Clinicopathologic models predicting non-sentinel lymph node metastasis in cutaneous melanoma patients: Are they useful for patients with a single positive sentinel node? J Surg Oncol 2021; 125:516-524. [PMID: 34735719 PMCID: PMC8799494 DOI: 10.1002/jso.26736] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 10/20/2021] [Accepted: 10/27/2021] [Indexed: 12/03/2022]
Abstract
Background and Objectives Of clinically node‐negative (cN0) cutaneous melanoma patients with sentinel lymph node (SLN) metastasis, between 10% and 30% harbor additional metastases in non‐sentinel lymph nodes (NSLNs). Approximately 80% of SLN‐positive patients have a single positive SLN. Methods To assess whether state‐of‐the‐art clinicopathologic models predicting NSLN metastasis had adequate performance, we studied a single‐institution cohort of 143 patients with cN0 SLN‐positive primary melanoma who underwent subsequent completion lymph node dissection. We used sensitivity (SE) and positive predictive value (PPV) to characterize the ability of the models to identify patients at high risk for NSLN disease. Results Across Stage III patients, all clinicopathologic models tested had comparable performances. The best performing model identified 52% of NSLN‐positive patients (SE = 52%, PPV = 37%). However, for the single SLN‐positive subgroup (78% of cohort), none of the models identified high‐risk patients (SE > 20%, PPV > 20%) irrespective of the chosen probability threshold used to define the binary risk labels. Thus, we designed a new model to identify high‐risk patients with a single positive SLN, which achieved a sensitivity of 49% (PPV = 26%). Conclusion For the largest SLN‐positive subgroup, those with a single positive SLN, current model performance is inadequate. New approaches are needed to better estimate nodal disease burden of these patients.
Collapse
Affiliation(s)
| | | | | | - Ajdin Kobic
- Department of Dermatology, Mayo Clinic, Rochester, Minnesota, USA
| | - Renske Wever
- Division of Bioinformatics, SkylineDx B.V., Rotterdam, The Netherlands
| | - Domenico Bellomo
- Division of Bioinformatics, SkylineDx B.V., Rotterdam, The Netherlands
| | - Alexander Meves
- Department of Dermatology, Mayo Clinic, Rochester, Minnesota, USA.,Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, Minnesota, USA
| | - Tina J Hieken
- Division of Breast and Melanoma Surgical Oncology, Department of Surgery, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
3
|
Gumaei A, Sammouda R, Al-Rakhami M, AlSalman H, El-Zaart A. Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression. Health Informatics J 2021; 27:1460458221989402. [PMID: 33570011 DOI: 10.1177/1460458221989402] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Collapse
Affiliation(s)
- Abdu Gumaei
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia.,Taiz University, Yemen
| | | | - Mabrook Al-Rakhami
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia
| | | | | |
Collapse
|
4
|
Eggermont AMM, Bellomo D, Arias-Mejias SM, Quattrocchi E, Sominidi-Damodaran S, Bridges AG, Lehman JS, Hieken TJ, Jakub JW, Murphree DH, Pittelkow MR, Sluzevich JC, Cappel MA, Bagaria SP, Perniciaro C, Tjien-Fooh FJ, Rentroia-Pacheco B, Wever R, van Vliet MH, Dwarkasing J, Meves A. Identification of stage I/IIA melanoma patients at high risk for disease relapse using a clinicopathologic and gene expression model. Eur J Cancer 2020; 140:11-18. [PMID: 33032086 PMCID: PMC7655519 DOI: 10.1016/j.ejca.2020.08.029] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 07/09/2020] [Accepted: 08/16/2020] [Indexed: 12/25/2022]
Abstract
PURPOSE Patients with stage I/IIA cutaneous melanoma (CM) are currently not eligible for adjuvant therapies despite uncertainty in relapse risk. Here, we studied the ability of a recently developed model which combines clinicopathologic and gene expression variables (CP-GEP) to identify stage I/IIA melanoma patients who have a high risk for disease relapse. PATIENTS AND METHODS Archival specimens from a cohort of 837 consecutive primary CMs were used for assessing the prognostic performance of CP-GEP. The CP-GEP model combines Breslow thickness and patient age, with the expression of eight genes in the primary tumour. Our specific patient group, represented by 580 stage I/IIA patients, was stratified based on their risk of relapse: CP-GEP High Risk and CP-GEP Low Risk. The main clinical end-point of this study was five-year relapse-free survival (RFS). RESULTS Within the stage I/IIA melanoma group, CP-GEP identified a high-risk patient group (47% of total stage I/IIA patients) which had a considerably worse five-year RFS than the low-risk patient group; 74% (95% confidence interval [CI]: 67%-80%) versus 89% (95% CI: 84%-93%); hazard ratio [HR] = 2.98 (95% CI: 1.78-4.98); P < 0.0001. Of patients in the high-risk group, those who relapsed were most likely to do so within the first 3 years. CONCLUSION The CP-GEP model can be used to identify stage I/IIA patients who have a high risk for disease relapse. These patients may benefit from adjuvant therapy.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | - Mark A Cappel
- Mayo Clinic, Jacksonville, FL, USA; Gulf Coast Dermatopathology Laboratory, Tampa, FL, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Bellomo D, Bridges AG, Hieken TJ, Meves A. Reply to E. K. Bartlett et al and A. H. R. Varey et al. JCO Precis Oncol 2020; 4:992-994. [PMID: 32914042 PMCID: PMC7480899 DOI: 10.1200/po.20.00289] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/26/2020] [Indexed: 11/20/2022] Open
Affiliation(s)
| | - Alina G Bridges
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
- Department of Dermatology, Mayo Clinic, Rochester, MN, USA
| | - Tina J Hieken
- Department of Surgery, Mayo Clinic, Rochester, MN, USA
| | | |
Collapse
|
6
|
Profiling of the known-unknown Passiflora variant complement by liquid chromatography - Ion mobility - Mass spectrometry. Talanta 2020; 221:121311. [PMID: 33076047 DOI: 10.1016/j.talanta.2020.121311] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 06/16/2020] [Accepted: 06/18/2020] [Indexed: 01/01/2023]
Abstract
Liquid Chromatography - Ion Mobility - Mass Spectrometry (LC-IM-MS) was utilized for non-targeted screening analysis to understand the variance in the composition of Passiflora species. Multivariate analysis was employed to explore a chemometric processing strategy for IM based Passiflora variant differentation. This approach was applied to the comparative analyses of extracts of the medicinal plants Passiflora alata, Passiflora edulis, Passiflora incarnata and Passiflora caerulea. In total, 255 occurrences of IM-MS resolved coeluting marker isomers and isobaric species were detected, providing increased coverage and specificity of species component markers compared to conventional LC-MS. A large proportion of medical plant phytochemical analysis information often remains redundant in that it is not phenotypic specific. Here, generation of Passiflora variant 'known-unknown' libraries has been used to compare Passiflora species to investigate unique variant features. Investigations of predicted collision cross section have enabled comparison of an element of the 'known-unknown' IM isomeric complement to be performed, facilitating a reduction in the number of possible variant unique isomeric identifications. In combination with spectral interpretation, it has been possible to resassign isomeric 'known-unknowns' as 'knowns'. The strategies employed illustrates the potential to facilitate identification of medicinal plant phytochemical components.
Collapse
|
7
|
Liu XY, Wang S, Zhang H, Zhang H, Yang ZY, Liang Y. Novel Regularization Method for Biomarker Selection and Cancer Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1329-1340. [PMID: 30716046 DOI: 10.1109/tcbb.2019.2897301] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Variable selection has attracted more attention in big data and machine learning fields. In high dimensional data analysis, many relevant variables or variable groups are widely found. For example, people pay more interests to biological pathway or regulatory network in microarray gene expression data. In recent years, regularization methods are commonly used approaches for variable selection. Existing regularization methods generally use L2 penalty to evaluate the grouping effect and penalty with a fixed value of q to evaluate the variable sparsity, respectively. These methods typically produce a good performance with high efficiency, but they often require the data to satisfy a certain probability distribution. In this paper, we propose a novel complex harmonic regularization (CHR) penalty function, which can approximate the combination of [Formula: see text] and regularizations with adjustable p and q to select the groups of the relevant variables. The CHR penalty function can be effectively solved by a direct path seeking algorithm. We demonstrate that the proposed CHR penalty function performs better than the state-of-the-art regularization methods in selecting groups of relevant variables and classification.
Collapse
|
8
|
Bellomo D, Arias-Mejias SM, Ramana C, Heim JB, Quattrocchi E, Sominidi-Damodaran S, Bridges AG, Lehman JS, Hieken TJ, Jakub JW, Pittelkow MR, DiCaudo DJ, Pockaj BA, Sluzevich JC, Cappel MA, Bagaria SP, Perniciaro C, Tjien-Fooh FJ, van Vliet MH, Dwarkasing J, Meves A. Model Combining Tumor Molecular and Clinicopathologic Risk Factors Predicts Sentinel Lymph Node Metastasis in Primary Cutaneous Melanoma. JCO Precis Oncol 2020; 4:319-334. [PMID: 32405608 PMCID: PMC7220172 DOI: 10.1200/po.19.00206] [Citation(s) in RCA: 63] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Purpose More than 80% of patients who undergo sentinel lymph node (SLN) biopsy have no nodal metastasis. Here we describe a model that combines clinicopathologic and molecular variables to identify patients with thin and intermediate thickness melanomas who may forgo the SLN biopsy procedure due to their low risk of nodal metastasis. Patients and Methods Genes with functional roles in melanoma metastasis were discovered by analysis of next generation sequencing data and case control studies. We then used PCR to quantify gene expression in diagnostic biopsy tissue across a prospectively designed archival cohort of 754 consecutive thin and intermediate thickness primary cutaneous melanomas. Outcome of interest was SLN biopsy metastasis within 90 days of melanoma diagnosis. A penalized maximum likelihood estimation algorithm was used to train logistic regression models in a repeated cross validation scheme to predict the presence of SLN metastasis from molecular, clinical and histologic variables. Results Expression of genes with roles in epithelial-to-mesenchymal transition (glia derived nexin, growth differentiation factor 15, integrin β3, interleukin 8, lysyl oxidase homolog 4, TGFβ receptor type 1 and tissue-type plasminogen activator) and melanosome function (melanoma antigen recognized by T cells 1) were associated with SLN metastasis. The predictive ability of a model that only considered clinicopathologic or gene expression variables was outperformed by a model which included molecular variables in combination with the clinicopathologic predictors Breslow thickness and patient age; AUC, 0.82; 95% CI, 0.78-0.86; SLN biopsy reduction rate of 42% at a negative predictive value of 96%. Conclusion A combined model including clinicopathologic and gene expression variables improved the identification of melanoma patients who may forgo the SLN biopsy procedure due to their low risk of nodal metastasis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Mark A Cappel
- Mayo Clinic, Jacksonville, FL, USA.,Gulf Coast Dermatopathology Laboratory, Tampa, FL, USA
| | | | | | | | | | | | | |
Collapse
|
9
|
Rodrigues V, Deusdado S. Deterministic Classifiers Accuracy Optimization for Cancer Microarray Data. PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 13TH INTERNATIONAL CONFERENCE 2020. [DOI: 10.1007/978-3-030-23873-5_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
10
|
Nye LC, Williams JP, Munjoma NC, Letertre MP, Coen M, Bouwmeester R, Martens L, Swann JR, Nicholson JK, Plumb RS, McCullagh M, Gethings LA, Lai S, Langridge JI, Vissers JP, Wilson ID. A comparison of collision cross section values obtained via travelling wave ion mobility-mass spectrometry and ultra high performance liquid chromatography-ion mobility-mass spectrometry: Application to the characterisation of metabolites in rat urine. J Chromatogr A 2019; 1602:386-396. [DOI: 10.1016/j.chroma.2019.06.056] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 06/24/2019] [Accepted: 06/26/2019] [Indexed: 01/01/2023]
|
11
|
Allahyar A, Ubels J, de Ridder J. A data-driven interactome of synergistic genes improves network-based cancer outcome prediction. PLoS Comput Biol 2019; 15:e1006657. [PMID: 30726216 PMCID: PMC6380593 DOI: 10.1371/journal.pcbi.1006657] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2018] [Revised: 02/19/2019] [Accepted: 11/20/2018] [Indexed: 12/13/2022] Open
Abstract
Robustly predicting outcome for cancer patients from gene expression is an important challenge on the road to better personalized treatment. Network-based outcome predictors (NOPs), which considers the cellular wiring diagram in the classification, hold much promise to improve performance, stability and interpretability of identified marker genes. Problematically, reports on the efficacy of NOPs are conflicting and for instance suggest that utilizing random networks performs on par to networks that describe biologically relevant interactions. In this paper we turn the prediction problem around: instead of using a given biological network in the NOP, we aim to identify the network of genes that truly improves outcome prediction. To this end, we propose SyNet, a gene network constructed ab initio from synergistic gene pairs derived from survival-labelled gene expression data. To obtain SyNet, we evaluate synergy for all 69 million pairwise combinations of genes resulting in a network that is specific to the dataset and phenotype under study and can be used to in a NOP model. We evaluated SyNet and 11 other networks on a compendium dataset of >4000 survival-labelled breast cancer samples. For this purpose, we used cross-study validation which more closely emulates real world application of these outcome predictors. We find that SyNet is the only network that truly improves performance, stability and interpretability in several existing NOPs. We show that SyNet overlaps significantly with existing gene networks, and can be confidently predicted (~85% AUC) from graph-topological descriptions of these networks, in particular the breast tissue-specific network. Due to its data-driven nature, SyNet is not biased to well-studied genes and thus facilitates post-hoc interpretation. We find that SyNet is highly enriched for known breast cancer genes and genes related to e.g. histological grade and tamoxifen resistance, suggestive of a role in determining breast cancer outcome.
Collapse
Affiliation(s)
- Amin Allahyar
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, The Netherlands
| | - Joske Ubels
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Skyline DX, Rotterdam
- Department of Hematology, Erasmus MC Cancer Institute, Rotterdam
| | - Jeroen de Ridder
- Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
12
|
Yan S, Zhang L, Song C. Applying a new maximum local asymmetry feature analysis method to improve near-term breast cancer risk prediction. Phys Med Biol 2018; 63:205010. [PMID: 30255850 DOI: 10.1088/1361-6560/aae452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Quantitative assessment of mammographic asymmetry has been investigated for breast cancer risk prediction. A new asymmetry feature extraction method was proposed in this study to enhance the risk prediction accuracy of near-term breast cancer. Breast areas in each pair of bilateral mammographic images were divided into several pairs of matched local annular regions and the maximum local asymmetry features (MLAF) were extracted from these regions. Radial basis function network (RBFN) was used to merge these features for breast cancer risk prediction. The dataset included 560 negative subjects. The risk prediction performance was tested using a leave-one-case-out (LOCO) cross-validation method. Area under the receiver operating characteristic curve (AUC) was used as the risk prediction performance evaluation index. AUC = 0.898 ± 0.013 was obtained by using the MLAFs extracted from the annular regions, which was significantly higher than the AUC value of 0.505 ± 0.025 achieved by using global asymmetry features computed from the whole breast regions (p < 0.05, DeLong's test) and much higher than the AUC values of 0.825 ± 0.017 and 0.717 ± 0.021 achieved by using MLAFs extracted from horizontal strip regions and vertical strip regions. The study demonstrated that near-term breast cancer risk prediction could be improved by using the proposed feature extraction method.
Collapse
Affiliation(s)
- Shiju Yan
- School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai 200093, People's Republic of China. Author to whom any correspondence should be addressed
| | | | | |
Collapse
|
13
|
Häberle L, Hack CC, Heusinger K, Wagner F, Jud SM, Uder M, Beckmann MW, Schulz-Wendtland R, Wittenberg T, Fasching PA. Using automated texture features to determine the probability for masking of a tumor on mammography, but not ultrasound. Eur J Med Res 2017; 22:30. [PMID: 28854966 PMCID: PMC5577694 DOI: 10.1186/s40001-017-0270-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2017] [Accepted: 08/11/2017] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Tumors in radiologically dense breast were overlooked on mammograms more often than tumors in low-density breasts. A fast reproducible and automated method of assessing percentage mammographic density (PMD) would be desirable to support decisions whether ultrasonography should be provided for women in addition to mammography in diagnostic mammography units. PMD assessment has still not been included in clinical routine work, as there are issues of interobserver variability and the procedure is quite time consuming. This study investigated whether fully automatically generated texture features of mammograms can replace time-consuming semi-automatic PMD assessment to predict a patient's risk of having an invasive breast tumor that is visible on ultrasound but masked on mammography (mammography failure). METHODS This observational study included 1334 women with invasive breast cancer treated at a hospital-based diagnostic mammography unit. Ultrasound was available for the entire cohort as part of routine diagnosis. Computer-based threshold PMD assessments ("observed PMD") were carried out and 363 texture features were obtained from each mammogram. Several variable selection and regression techniques (univariate selection, lasso, boosting, random forest) were applied to predict PMD from the texture features. The predicted PMD values were each used as new predictor for masking in logistic regression models together with clinical predictors. These four logistic regression models with predicted PMD were compared among themselves and with a logistic regression model with observed PMD. The most accurate masking prediction was determined by cross-validation. RESULTS About 120 of the 363 texture features were selected for predicting PMD. Density predictions with boosting were the best substitute for observed PMD to predict masking. Overall, the corresponding logistic regression model performed better (cross-validated AUC, 0.747) than one without mammographic density (0.734), but less well than the one with the observed PMD (0.753). However, in patients with an assigned mammography failure risk >10%, covering about half of all masked tumors, the boosting-based model performed at least as accurately as the original PMD model. CONCLUSION Automatically generated texture features can replace semi-automatically determined PMD in a prediction model for mammography failure, such that more than 50% of masked tumors could be discovered.
Collapse
Affiliation(s)
- Lothar Häberle
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany. .,Biostatistics Unit, Department of Gynecology and Obstetrics, Erlangen University Hospital, Erlangen, Germany.
| | - Carolin C Hack
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
| | - Katharina Heusinger
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
| | - Florian Wagner
- Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany
| | - Sebastian M Jud
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
| | - Michael Uder
- University Breast Center for Franconia, Institute of Radiology, Comprehensive Cancer Center EMN, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen, Germany
| | - Matthias W Beckmann
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
| | - Rüdiger Schulz-Wendtland
- University Breast Center for Franconia, Institute of Radiology, Comprehensive Cancer Center EMN, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen, Germany
| | | | - Peter A Fasching
- University Breast Center for Franconia, Department of Gynecology and Obstetrics, Erlangen University Hospital, Friedrich Alexander University of Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany.,Division Hematology/Oncology, Department of Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
14
|
Naue J, Hoefsloot HCJ, Mook ORF, Rijlaarsdam-Hoekstra L, van der Zwalm MCH, Henneman P, Kloosterman AD, Verschure PJ. Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression. Forensic Sci Int Genet 2017; 31:19-28. [PMID: 28841467 DOI: 10.1016/j.fsigen.2017.07.015] [Citation(s) in RCA: 105] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2017] [Revised: 07/26/2017] [Accepted: 07/30/2017] [Indexed: 01/24/2023]
Abstract
The use of DNA methylation (DNAm) to obtain additional information in forensic investigations showed to be a promising and increasing field of interest. Prediction of the chronological age based on age-dependent changes in the DNAm of specific CpG sites within the genome is one such potential application. Here we present an age-prediction tool for whole blood based on massive parallel sequencing (MPS) and a random forest machine learning algorithm. MPS allows accurate DNAm determination of pre-selected markers and neighboring CpG-sites to identify the best age-predictive markers for the age-prediction tool. 15 age-dependent markers of different loci were initially chosen based on publicly available 450K microarray data, and 13 finally selected for the age tool based on MPS (DDO, ELOVL2, F5, GRM2, HOXC4, KLF14, LDB2, MEIS1-AS3, NKIRAS2, RPA2, SAMD10, TRIM59, ZYG11A). Whole blood samples of 208 individuals were used for training of the algorithm and a further 104 individuals were used for model evaluation (age 18-69). In the case of KLF14, LDB2, SAMD10, and GRM2, neighboring CpG sites and not the initial 450K sites were chosen for the final model. Cross-validation of the training set leads to a mean absolute deviation (MAD) of 3.21 years and a root-mean square error (RMSE) of 3.97 years. Evaluation of model performance using the test set showed a comparable result (MAD 3.16 years, RMSE 3.93 years). A reduced model based on only the top 4 markers (ELOVL2, F5, KLF14, and TRIM59) resulted in a RMSE of 4.19 years and MAD of 3.24 years for the test set (cross validation training set: RMSE 4.63 years, MAD 3.64 years). The amplified region was additionally investigated for occurrence of SNPs in case of an aberrant DNAm result, which in some cases can be an indication for a deviation in DNAm. Our approach uncovered well-known DNAm age-dependent markers, as well as additional new age-dependent sites for improvement of the model, and allowed the creation of a reliable and accurate epigenetic tool for age-prediction without restriction to a linear change in DNAm with age.
Collapse
Affiliation(s)
- Jana Naue
- University of Amsterdam, Swammerdam Institute for Life Sciences, Science Park 904, 1098XH Amsterdam, The Netherlands.
| | - Huub C J Hoefsloot
- University of Amsterdam, Swammerdam Institute for Life Sciences, Science Park 904, 1098XH Amsterdam, The Netherlands
| | - Olaf R F Mook
- Amsterdam Medical Center, Clinical Genetics, Meibergdreef 9, 1105AZ, Amsterdam, The Netherlands
| | - Laura Rijlaarsdam-Hoekstra
- University of Amsterdam, Swammerdam Institute for Life Sciences, Science Park 904, 1098XH Amsterdam, The Netherlands
| | - Marloes C H van der Zwalm
- University of Amsterdam, Swammerdam Institute for Life Sciences, Science Park 904, 1098XH Amsterdam, The Netherlands
| | - Peter Henneman
- Amsterdam Medical Center, Clinical Genetics, Meibergdreef 9, 1105AZ, Amsterdam, The Netherlands
| | - Ate D Kloosterman
- Netherlands Forensic Institute, Biological Traces, Laan van Ypenburg 6, 2497GB Den Haag, The Netherlands; University of Amsterdam, Institute for Biodiversity and Dynamics, Science Park 904, 1098XH Amsterdam, The Netherlands
| | - Pernette J Verschure
- University of Amsterdam, Swammerdam Institute for Life Sciences, Science Park 904, 1098XH Amsterdam, The Netherlands.
| |
Collapse
|
15
|
Applying a new bilateral mammographic density segmentation method to improve accuracy of breast cancer risk prediction. Int J Comput Assist Radiol Surg 2017; 12:1819-1828. [PMID: 28726117 DOI: 10.1007/s11548-017-1648-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2017] [Accepted: 07/12/2017] [Indexed: 10/19/2022]
Abstract
PURPOSE How to optimally detect bilateral mammographic asymmetry and improve risk prediction accuracy remains a difficult and unsolved issue. Our aim was to find an effective mammographic density segmentation method to improve accuracy of breast cancer risk prediction. METHODS A dataset including 168 negative mammography screening cases was used. We applied a mutual threshold to bilateral mammograms of left and right breasts to segment the dense breast regions. The mutual threshold was determined by the median grayscale value of all pixels in both left and right breast regions. For each case, we then computed three types of image features representing asymmetry, mean and the maximum of the image features, respectively. A two-stage classification scheme was developed to fuse the three types of features. The risk prediction performance was tested using a leave-one-case-out cross-validation method. RESULTS By using the new density segmentation method, the computed area under the receiver operating characteristic curve was 0.830 ± 0.033 and overall prediction accuracy was 81.0%, significantly higher than those of 0.633 ± 0.043 and 57.1% achieved by using the previous density segmentation method ([Formula: see text], t-test). CONCLUSIONS A new mammographic density segmentation method based on a bilateral mutual threshold can be used to more effectively detect bilateral mammographic density asymmetry and help significantly improve accuracy of near-term breast cancer risk prediction.
Collapse
|
16
|
Häberle L, Hein A, Rübner M, Schneider M, Ekici AB, Gass P, Hartmann A, Schulz-Wendtland R, Beckmann MW, Lo WY, Schroth W, Brauch H, Fasching PA, Wunderle M. Predicting Triple-Negative Breast Cancer Subtype Using Multiple Single Nucleotide Polymorphisms for Breast Cancer Risk and Several Variable Selection Methods. Geburtshilfe Frauenheilkd 2017; 77:667-678. [PMID: 28757654 DOI: 10.1055/s-0043-111602] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Revised: 05/15/2017] [Accepted: 05/16/2017] [Indexed: 12/22/2022] Open
Abstract
INTRODUCTION Studies of triple-negative breast cancer have recently been extending the inclusion criteria and incorporating additional molecular markers into the selection criteria, opening up scope for targeted therapies. The screening phases required for studies of this type are often prolonged, since the process of determining the molecular subtype and carrying out additional biomarker assessment is time-consuming. Parameters such as germline genotypes capable of predicting the molecular subtype before it becomes available from pathology might be helpful for treatment planning and optimizing the timing and cost of screening phases. This appears to be feasible, as rapid and low-cost genotyping methods are becoming increasingly available. The aim of this study was to identify single nucleotide polymorphisms (SNPs) for breast cancer risk capable of predicting triple negativity, in addition to clinical predictors, in breast cancer patients. METHODS This cross-sectional observational study included 1271 women with invasive breast cancer who were treated at a university hospital. A total of 76 validated breast cancer risk SNPs were successfully genotyped. Univariate associations between each SNP and triple negativity were explored using logistic regression analyses. Several variable selection and regression techniques were applied to identify a set of SNPs that together improve the prediction of triple negativity in addition to the clinical predictors of age at diagnosis and body mass index (BMI). The most accurate prediction method was determined by cross-validation. RESULTS The SNP rs10069690 (TERT, CLPTM1L) was the only significant SNP (corrected p = 0.02) after correction of p values for multiple testing in the univariate analyses. This SNP and three additional SNPs from the genes RAD51B, CCND1, and FGFR2 were selected for prediction of triple negativity. The addition of these SNPs to clinical predictors increased the cross-validated area under the curve (AUC) from 0.618 to 0.625. Age at diagnosis was the strongest predictor, stronger than any genetic characteristics. CONCLUSION Prediction of triple-negative breast cancer can be improved if SNPs associated with breast cancer risk are added to a prediction rule based on age at diagnosis and BMI. This finding could be used for prescreening purposes in complex molecular therapy studies for triple-negative breast cancer.
Collapse
Affiliation(s)
- Lothar Häberle
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany.,Biostatistics Unit, Department of Gynecology and Obstetrics, Erlangen University Hospital, Erlangen, Germany
| | - Alexander Hein
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Matthias Rübner
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Michael Schneider
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Arif B Ekici
- Institute of Human Genetics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Paul Gass
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Arndt Hartmann
- Institute of Pathology, Erlangen University Hospital, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Rüdiger Schulz-Wendtland
- Institute of Diagnostic Radiology, Erlangen University Hospital, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Matthias W Beckmann
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Wing-Yee Lo
- Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany.,University of Tübingen, Tübingen, Germany
| | - Werner Schroth
- Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany.,University of Tübingen, Tübingen, Germany
| | - Hiltrud Brauch
- Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany.,University of Tübingen, Tübingen, Germany.,German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Peter A Fasching
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Marius Wunderle
- Department of Gynecology and Obstetrics, Erlangen University Hospital, University Breast Center for Franconia, Comprehensive Cancer Center Erlangen-EMN, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| |
Collapse
|
17
|
Lottaz C, Gronwald W, Spang R, Engelmann JC. High-Dimensional Profiling for Computational Diagnosis. Methods Mol Biol 2017; 1526:205-229. [PMID: 27896744 DOI: 10.1007/978-1-4939-6613-4_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
New technologies allow for high-dimensional profiling of patients. For instance, genome-wide gene expression analysis in tumors or in blood is feasible with microarrays, if all transcripts are known, or even without this restriction using high-throughput RNA sequencing. Other technologies like NMR finger printing allow for high-dimensional profiling of metabolites in blood or urine. Such technologies for high-dimensional patient profiling represent novel possibilities for molecular diagnostics. In clinical profiling studies, researchers aim to predict disease type, survival, or treatment response for new patients using high-dimensional profiles. In this process, they encounter a series of obstacles and pitfalls. We review fundamental issues from machine learning and recommend a procedure for the computational aspects of a clinical profiling study.
Collapse
Affiliation(s)
- Claudio Lottaz
- Institute of Functional Genomics, University of Regensburg, Regensburg, Germany.
| | - Wolfram Gronwald
- Institute of Functional Genomics, University of Regensburg, Regensburg, Germany
| | - Rainer Spang
- Institute of Functional Genomics, University of Regensburg, Regensburg, Germany
| | - Julia C Engelmann
- Institute of Functional Genomics, University of Regensburg, Regensburg, Germany
| |
Collapse
|
18
|
InFlo: a novel systems biology framework identifies cAMP-CREB1 axis as a key modulator of platinum resistance in ovarian cancer. Oncogene 2016; 36:2472-2482. [PMID: 27819677 PMCID: PMC5415943 DOI: 10.1038/onc.2016.398] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 08/23/2016] [Accepted: 09/18/2016] [Indexed: 01/05/2023]
Abstract
Characterizing the complex interplay of cellular processes in cancer would enable the discovery of key mechanisms underlying its development and progression. Published approaches to decipher driver mechanisms do not explicitly model tissue-specific changes in pathway networks and the regulatory disruptions related to genomic aberrations in cancers. We therefore developed InFlo, a novel systems biology approach for characterizing complex biological processes using a unique multidimensional framework integrating transcriptomic, genomic and/or epigenomic profiles for any given cancer sample. We show that InFlo robustly characterizes tissue-specific differences in activities of signalling networks on a genome scale using unique probabilistic models of molecular interactions on a per-sample basis. Using large-scale multi-omics cancer datasets, we show that InFlo exhibits higher sensitivity and specificity in detecting pathway networks associated with specific disease states when compared to published pathway network modelling approaches. Furthermore, InFlo's ability to infer the activity of unmeasured signalling network components was also validated using orthogonal gene expression signatures. We then evaluated multi-omics profiles of primary high-grade serous ovarian cancer tumours (N=357) to delineate mechanisms underlying resistance to frontline platinum-based chemotherapy. InFlo was the only algorithm to identify hyperactivation of the cAMP-CREB1 axis as a key mechanism associated with resistance to platinum-based therapy, a finding that we subsequently experimentally validated. We confirmed that inhibition of CREB1 phosphorylation potently sensitized resistant cells to platinum therapy and was effective in killing ovarian cancer stem cells that contribute to both platinum-resistance and tumour recurrence. Thus, we propose InFlo to be a scalable and widely applicable and robust integrative network modelling framework for the discovery of evidence-based biomarkers and therapeutic targets.
Collapse
|
19
|
CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data. Comput Biol Med 2016; 79:68-79. [PMID: 27764717 DOI: 10.1016/j.compbiomed.2016.10.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Revised: 10/05/2016] [Accepted: 10/10/2016] [Indexed: 12/18/2022]
Abstract
Feature selection and ranking is of great importance in the analysis of biomedical data. In addition to reducing the number of features used in classification or other machine learning tasks, it allows us to extract meaningful biological and medical information from a machine learning model. Most existing approaches in this domain do not directly model the fact that the relative importance of features can be different in different regions of the feature space. In this work, we present a context aware feature ranking algorithm called CAFÉ-Map. CAFÉ-Map is a locally linear feature ranking framework that allows recognition of important features in any given region of the feature space or for any individual example. This allows for simultaneous classification and feature ranking in an interpretable manner. We have benchmarked CAFÉ-Map on a number of toy and real world biomedical data sets. Our comparative study with a number of published methods shows that CAFÉ-Map achieves better accuracies on these data sets. The top ranking features obtained through CAFÉ-Map in a gene profiling study correlate very well with the importance of different genes reported in the literature. Furthermore, CAFÉ-Map provides a more in-depth analysis of feature ranking at the level of individual examples. AVAILABILITY CAFÉ-Map Python code is available at: http://faculty.pieas.edu.pk/fayyaz/software.html#cafemap . The CAFÉ-Map package supports parallelization and sparse data and provides example scripts for classification. This code can be used to reconstruct the results given in this paper.
Collapse
|
20
|
Hamm A, Prenen H, Van Delm W, Di Matteo M, Wenes M, Delamarre E, Schmidt T, Weitz J, Sarmiento R, Dezi A, Gasparini G, Rothé F, Schmitz R, D'Hoore A, Iserentant H, Hendlisz A, Mazzone M. Tumour-educated circulating monocytes are powerful candidate biomarkers for diagnosis and disease follow-up of colorectal cancer. Gut 2016; 65:990-1000. [PMID: 25814648 DOI: 10.1136/gutjnl-2014-308988] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 03/06/2015] [Indexed: 12/23/2022]
Abstract
OBJECTIVE Cancer immunology is a growing field of research whose aim is to develop innovative therapies and diagnostic tests. Starting from the hypothesis that immune cells promptly respond to harmful stimuli, we used peripheral blood monocytes in order to characterise a distinct gene expression profile and to evaluate its potential as a candidate diagnostic biomarker in patients with colorectal cancer (CRC), a still unmet clinical need. DESIGN We performed a case-control study including 360 peripheral blood monocyte samples from four European oncological centres and defined a gene expression profile specific to CRC. The robustness of the genetic profile and disease specificity were assessed in an independent setting. RESULTS This screen returned 43 putative diagnostic markers, which we refined and validated in the confirmative multicentric analysis to 23 genes with outstanding diagnostic accuracy (area under the curve (AUC)=0.99 (0.99 to 1.00), Se=100.0% (100.0% to 100.0%), Sp=92.9% (78.6% to 100.0%) in multiple-gene receiver operating characteristic analysis). The diagnostic accuracy was robustly maintained in prospectively collected independent samples (AUC=0.95 (0.85 to 1.00), Se=92.6% (81.5% to 100.0%), Sp=92.3% (76.9% to 100.0%). This monocyte signature was expressed at early disease onset, remained robust over the course of disease progression, and was specific for the monocytic fraction of mononuclear cells. The gene modulation was induced specifically by soluble factors derived from transformed colon epithelium in comparison to normal colon or other cancer histotypes. Moreover, expression changes were plastic and reversible, as they were abrogated upon withdrawal of these tumour-released factors. Consistently, the modified set of genes reverted to normal expression upon curative treatment and was specific for CRC. CONCLUSIONS Our study is the first to demonstrate monocyte plasticity in response to tumour-released soluble factors. The identified distinct signature in tumour-educated monocytes might be used as a candidate biomarker in CRC diagnosis and harbours the potential for disease follow-up and therapeutic monitoring.
Collapse
Affiliation(s)
- Alexander Hamm
- Laboratory of Molecular Oncology and Angiogenesis, Vesalius Research Center, VIB, Leuven, Belgium Laboratory of Molecular Oncology and Angiogenesis, Department of Oncology, Vesalius Research Center, KU Leuven, Leuven, Belgium
| | - Hans Prenen
- Digestive Oncology, University Hospitals Leuven and Department of Oncology, KU Leuven, Leuven, Belgium
| | | | - Mario Di Matteo
- Laboratory of Molecular Oncology and Angiogenesis, Vesalius Research Center, VIB, Leuven, Belgium Laboratory of Molecular Oncology and Angiogenesis, Department of Oncology, Vesalius Research Center, KU Leuven, Leuven, Belgium
| | - Mathias Wenes
- Laboratory of Molecular Oncology and Angiogenesis, Vesalius Research Center, VIB, Leuven, Belgium Laboratory of Molecular Oncology and Angiogenesis, Department of Oncology, Vesalius Research Center, KU Leuven, Leuven, Belgium
| | - Estelle Delamarre
- Laboratory of Molecular Oncology and Angiogenesis, Vesalius Research Center, VIB, Leuven, Belgium Laboratory of Molecular Oncology and Angiogenesis, Department of Oncology, Vesalius Research Center, KU Leuven, Leuven, Belgium
| | - Thomas Schmidt
- Department of General, Visceral, and Transplantation Surgery, University of Heidelberg, Heidelberg, Germany
| | - Jürgen Weitz
- Department of General, Visceral, and Transplantation Surgery, University of Heidelberg, Heidelberg, Germany Department of Visceral, Thoracic, and Vascular Surgery, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | | | - Angelo Dezi
- Department of Oncology, San Filippo Neri, Rome, Italy
| | | | - Françoise Rothé
- Medical Oncology Clinic, Institut Jules Bordet, Brussels, Belgium
| | - Robin Schmitz
- Department of General, Visceral, and Transplantation Surgery, University of Heidelberg, Heidelberg, Germany
| | - André D'Hoore
- Department of Abdominal Surgery, University Hospitals Leuven, KU Leuven, Leuven, Belgium
| | | | - Alain Hendlisz
- Medical Oncology Clinic, Institut Jules Bordet, Brussels, Belgium
| | - Massimiliano Mazzone
- Laboratory of Molecular Oncology and Angiogenesis, Vesalius Research Center, VIB, Leuven, Belgium Laboratory of Molecular Oncology and Angiogenesis, Department of Oncology, Vesalius Research Center, KU Leuven, Leuven, Belgium
| |
Collapse
|
21
|
Huang HH, Liu XY, Liang Y. Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization. PLoS One 2016; 11:e0149675. [PMID: 27136190 PMCID: PMC4852916 DOI: 10.1371/journal.pone.0149675] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 02/02/2016] [Indexed: 11/18/2022] Open
Abstract
Cancer classification and feature (gene) selection plays an important role in knowledge discovery in genomic data. Although logistic regression is one of the most popular classification methods, it does not induce feature selection. In this paper, we presented a new hybrid L1/2 +2 regularization (HLR) function, a linear combination of L1/2 and L2 penalties, to select the relevant gene in the logistic regression. The HLR approach inherits some fascinating characteristics from L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding approach to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations indicate that the proposed method is highly competitive amongst several state-of-the-art methods.
Collapse
Affiliation(s)
- Hai-Hui Huang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
| | - Xiao-Ying Liu
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
| | - Yong Liang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
- * E-mail:
| |
Collapse
|
22
|
Jong VL, Novianti PW, Roes KCB, Eijkemans MJC. Selecting a classification function for class prediction with gene expression data. Bioinformatics 2016; 32:1814-22. [PMID: 26873933 DOI: 10.1093/bioinformatics/btw034] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/15/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Class predicting with gene expression is widely used to generate diagnostic and/or prognostic models. The literature reveals that classification functions perform differently across gene expression datasets. The question, which classification function should be used for a given dataset remains to be answered. In this study, a predictive model for choosing an optimal function for class prediction on a given dataset was devised. RESULTS To achieve this, gene expression data were simulated for different values of gene-pairs correlations, sample size, genes' variances, deferentially expressed genes and fold changes. For each simulated dataset, ten classifiers were built and evaluated using ten classification functions. The resulting accuracies from 1152 different simulation scenarios by ten classification functions were then modeled using a linear mixed effects regression on the studied data characteristics, yielding a model that predicts the accuracy of the functions on a given data. An application of our model on eight real-life datasets showed positive correlations (0.33-0.82) between the predicted and expected accuracies. CONCLUSION The here presented predictive model might serve as a guide to choose an optimal classification function among the 10 studied functions, for any given gene expression data. AVAILABILITY AND IMPLEMENTATION The R source code for the analysis and an R-package 'SPreFuGED' are available at Bioinformatics online. CONTACT v.l.jong@umcutecht.nl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Victor L Jong
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands, Viroscience Lab, Erasmus Medical Center Rotterdam, Rotterdam, CE 3015, The Netherlands and
| | - Putri W Novianti
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands, Epidemiology & Biostatistics Department, Vrije University Medical Center Amsterdam, HV Amsterdam 1081, The Netherlands
| | - Kit C B Roes
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands
| | - Marinus J C Eijkemans
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands
| |
Collapse
|
23
|
Jong VL, Novianti PW, Roes KCB, Eijkemans MJC. Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories. Stat Appl Genet Mol Biol 2015; 13:717-32. [PMID: 25503674 DOI: 10.1515/sagmb-2014-0003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
Collapse
|
24
|
Factors affecting the accuracy of a class prediction model in gene expression data. BMC Bioinformatics 2015; 16:199. [PMID: 26093633 PMCID: PMC4475623 DOI: 10.1186/s12859-015-0610-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 04/30/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Class prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer. RESULTS Datasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation. CONCLUSIONS We evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models.
Collapse
|
25
|
Matamala N, Vargas MT, González-Cámpora R, Miñambres R, Arias JI, Menéndez P, Andrés-León E, Gómez-López G, Yanowsky K, Calvete-Candenas J, Inglada-Pérez L, Martínez-Delgado B, Benítez J. Tumor microRNA expression profiling identifies circulating microRNAs for early breast cancer detection. Clin Chem 2015; 61:1098-106. [PMID: 26056355 DOI: 10.1373/clinchem.2015.238691] [Citation(s) in RCA: 148] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 05/07/2015] [Indexed: 01/20/2023]
Abstract
BACKGROUND The identification of novel biomarkers for early breast cancer detection would be a great advance. Because of their role in tumorigenesis and stability in body fluids, microRNAs (miRNAs) are emerging as a promising diagnostic tool. Our aim was to identify miRNAs deregulated in breast tumors and evaluate the potential of circulating miRNAs in breast cancer detection. METHODS We conducted miRNA expression profiling of 1919 human miRNAs in paraffin-embedded tissue from 122 breast tumors and 11 healthy breast tissue samples. Differential expression analysis was performed, and a microarray classifier was generated. The most relevant miRNAs were analyzed in plasma from 26 healthy individuals and 83 patients with breast cancer (36 before and 47 after treatment) and validated in 116 healthy individuals and 114 patients before treatment. RESULTS We identified a large number of miRNAs deregulated in breast cancer and generated a 25-miRNA microarray classifier that discriminated breast tumors with high diagnostic sensitivity and specificity. Ten miRNAs were selected for further investigation, of which 4 (miR-505-5p, miR-125b-5p, miR-21-5p, and miR-96-5p) were significantly overexpressed in pretreated patients with breast cancer compared with healthy individuals in 2 different series of plasma. MiR-505-5p and miR-96-5p were the most valuable biomarkers (area under the curve 0.72). Moreover, the expression levels of miR-3656, miR-505-5p, and miR-21-5p were decreased in a group of treated patients. CONCLUSIONS Circulating miRNAs reflect the presence of breast tumors. The identification of deregulated miRNAs in plasma of patients with breast cancer supports the use of circulating miRNAs as a method for early breast cancer detection.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Eduardo Andrés-León
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Gonzalo Gómez-López
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | | - Lucía Inglada-Pérez
- Human Cancer Genetics Programme and Spanish Network in Rare Diseases (CIBERER), Madrid, Spain
| | - Beatriz Martínez-Delgado
- Molecular Genetics Unit, Research Institute of Rare Diseases (IIER), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Javier Benítez
- Human Cancer Genetics Programme and Spanish Network in Rare Diseases (CIBERER), Madrid, Spain;
| |
Collapse
|
26
|
van den Berg BA, Reinders MJT, de Ridder D, de Beer TAP. Insight into neutral and disease-associated human genetic variants through interpretable predictors. PLoS One 2015; 10:e0120729. [PMID: 25826299 PMCID: PMC4380319 DOI: 10.1371/journal.pone.0120729] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2014] [Accepted: 01/14/2015] [Indexed: 11/30/2022] Open
Abstract
A variety of methods that predict human nonsynonymous single nucleotide polymorphisms (SNPs) to be neutral or disease-associated have been developed over the last decade. These methods are used for pinpointing disease-associated variants in the many variants obtained with next-generation sequencing technologies. The high performances of current sequence-based predictors indicate that sequence data contains valuable information about a variant being neutral or disease-associated. However, most predictors do not readily disclose this information, and so it remains unclear what sequence properties are most important. Here, we show how we can obtain insight into sequence characteristics of variants and their surroundings by interpreting predictors. We used an extensive range of features derived from the variant itself, its surrounding sequence, sequence conservation, and sequence annotation, and employed linear support vector machine classifiers to enable extracting feature importance from trained predictors. Our approach is useful for providing additional information about what features are most important for the predictions made. Furthermore, for large sets of known variants, it can provide insight into the mechanisms responsible for variants being disease-associated.
Collapse
Affiliation(s)
- Bastiaan A. van den Berg
- Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
- Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
- Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands
| | - Dick de Ridder
- Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
- Kluyver Centre for Genomics of Industrial Fermentation, Delft, The Netherlands
| | - Tjaart A. P. de Beer
- European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
- * E-mail:
| |
Collapse
|
27
|
Taskesen E, Babaei S, Reinders MMJ, de Ridder J. Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia. BMC Bioinformatics 2015; 16 Suppl 4:S5. [PMID: 25734246 PMCID: PMC4347619 DOI: 10.1186/1471-2105-16-s4-s5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Acute Myeloid Leukemia (AML) is characterized by various cytogenetic and molecular abnormalities. Detection of these abnormalities is important in the risk-classification of patients but requires laborious experimentation. Various studies showed that gene expression profiles (GEP), and the gene signatures derived from GEP, can be used for the prediction of subtypes in AML. Similarly, successful prediction was also achieved by exploiting DNA-methylation profiles (DMP). There are, however, no studies that compared classification accuracy and performance between GEP and DMP, neither are there studies that integrated both types of data to determine whether predictive power can be improved. Approach Here, we used 344 well-characterized AML samples for which both gene expression and DNA-methylation profiles are available. We created three different classification strategies including early, late and no integration of these datasets and used them to predict AML subtypes using a logistic regression model with Lasso regularization. Results We illustrate that both gene expression and DNA-methylation profiles contain distinct patterns that contribute to discriminating AML subtypes and that an integration strategy can exploit these patterns to achieve synergy between both data types. We show that concatenation of features from both data sets, i.e. early integration, improves the predictive power compared to classifiers trained on GEP or DMP alone. A more sophisticated strategy, i.e. the late integration strategy, employs a two-layer classifier which outperforms the early integration strategy. Conclusion We demonstrate that prediction of known cytogenetic and molecular abnormalities in AML can be further improved by integrating GEP and DMP profiles.
Collapse
|
28
|
Kempowsky-Hamon T, Valle C, Lacroix-Triki M, Hedjazi L, Trouilh L, Lamarre S, Labourdette D, Roger L, Mhamdi L, Dalenc F, Filleron T, Favre G, François JM, Le Lann MV, Anton-Leberre V. Fuzzy logic selection as a new reliable tool to identify molecular grade signatures in breast cancer--the INNODIAG study. BMC Med Genomics 2015; 8:3. [PMID: 25888889 PMCID: PMC4342216 DOI: 10.1186/s12920-015-0077-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2014] [Accepted: 01/12/2015] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Personalized medicine has become a priority in breast cancer patient management. In addition to the routinely used clinicopathological characteristics, clinicians will have to face an increasing amount of data derived from tumor molecular profiling. The aims of this study were to develop a new gene selection method based on a fuzzy logic selection and classification algorithm, and to validate the gene signatures obtained on breast cancer patient cohorts. METHODS We analyzed data from four published gene expression datasets for breast carcinomas. We identified the best discriminating genes by comparing molecular expression profiles between histologic grade 1 and 3 tumors for each of the training datasets. The most pertinent probes were selected and used to define fuzzy molecular grade 1-like (good prognosis) and fuzzy molecular grade 3-like (poor prognosis) profiles. To evaluate the prognostic performance of the fuzzy grade signatures in breast cancer tumors, a Kaplan-Meier analysis was conducted to compare the relapse-free survival deduced from histologic grade and fuzzy molecular grade classification. RESULTS We applied the fuzzy logic selection on breast cancer databases and obtained four new gene signatures. Analysis in the training public sets showed good performance of these gene signatures for grade (sensitivity from 90% to 95%, specificity 67% to 93%). To validate these gene signatures, we designed probes on custom microarrays and tested them on 150 invasive breast carcinomas. Good performance was obtained with an error rate of less than 10%. For one gene signature, among 74 histologic grade 3 and 18 grade 1 tumors, 88 cases (96%) were correctly assigned. Interestingly histologic grade 2 tumors (n = 58) were split in these two molecular grade categories. CONCLUSION We confirmed the use of fuzzy logic selection as a new tool to identify gene signatures with good reliability and increased classification power. This method based on artificial intelligence algorithms was successfully applied to breast cancers molecular grade classification allowing histologic grade 2 classification into grade 1 and grade 2 like to improve patients prognosis. It opens the way to further development for identification of new biomarker combinations in other applications such as prediction of treatment response.
Collapse
Affiliation(s)
- Tatiana Kempowsky-Hamon
- CNRS, LAAS, F-31400, Toulouse, France.
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
| | - Carine Valle
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
| | - Magali Lacroix-Triki
- Institut Claudius Regaud, Biology and Pathology Department; INSERM UMR1037, Toulouse, France.
| | - Lyamine Hedjazi
- CNRS, LAAS, F-31400, Toulouse, France.
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
| | - Lidwine Trouilh
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
| | - Sophie Lamarre
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
| | - Delphine Labourdette
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
| | - Laurence Roger
- Institut Claudius Regaud, Biology and Pathology Department; INSERM UMR1037, Toulouse, France.
| | - Loubna Mhamdi
- Institut Claudius Regaud, Biology and Pathology Department; INSERM UMR1037, Toulouse, France.
| | | | - Thomas Filleron
- Institut Claudius Regaud, Oncology Department, Toulouse, France.
| | - Gilles Favre
- Institut Claudius Regaud, Biology and Pathology Department; INSERM UMR1037, Toulouse, France.
| | - Jean-Marie François
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
- Dendris SAS, 8 Rue de Cugnaux, 31300, Toulouse, France.
| | - Marie-Véronique Le Lann
- CNRS, LAAS, F-31400, Toulouse, France.
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
| | - Véronique Anton-Leberre
- Université de Toulouse; INSA, UPS, INP; LISBP, F-31077, Toulouse, France.
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400, Toulouse, France.
- CNRS, UMR5504, F-31400, Toulouse, France.
| |
Collapse
|
29
|
Ma C, Zhang HH, Wang X. Machine learning for Big Data analytics in plants. TRENDS IN PLANT SCIENCE 2014; 19:798-808. [PMID: 25223304 DOI: 10.1016/j.tplants.2014.08.004] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 07/30/2014] [Accepted: 08/20/2014] [Indexed: 05/19/2023]
Abstract
Rapid advances in high-throughput genomic technology have enabled biology to enter the era of 'Big Data' (large datasets). The plant science community not only needs to build its own Big-Data-compatible parallel computing and data management infrastructures, but also to seek novel analytical paradigms to extract information from the overwhelming amounts of data. Machine learning offers promising computational and analytical solutions for the integrative analysis of large, heterogeneous and unstructured datasets on the Big-Data scale, and is gradually gaining popularity in biology. This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences.
Collapse
Affiliation(s)
- Chuang Ma
- School of Plant Sciences, University of Arizona, 1140 E. South Campus Drive, Tucson, AZ 85721, USA
| | - Hao Helen Zhang
- Department of Mathematics, University of Arizona, 617 North Santa Rita Ave, Tucson, AZ 85721, USA
| | - Xiangfeng Wang
- School of Plant Sciences, University of Arizona, 1140 E. South Campus Drive, Tucson, AZ 85721, USA; Department of Plant Genetics and Breeding, College of Agronomy and Biotechnology, China Agricultural University, Beijing 100193, China.
| |
Collapse
|
30
|
Tanić M, Yanowski K, Andrés E, Gómez-López G, Socorro MRP, Pisano DG, Martinez-Delgado B, Benítez J. miRNA expression profiling of formalin-fixed paraffin-embedded (FFPE) hereditary breast tumors. GENOMICS DATA 2014; 3:75-9. [PMID: 26484152 PMCID: PMC4535901 DOI: 10.1016/j.gdata.2014.11.008] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Revised: 11/13/2014] [Accepted: 11/17/2014] [Indexed: 10/28/2022]
Abstract
Hereditary breast cancer constitutes only 5-10% of all breast cancer cases and is characterized by strong family history of breast and/or other associated cancer types. Only ~ 25% of hereditary breast cancer cases carry a mutation in BRCA1 or BRCA2 gene, while mutations in other rare high and moderate-risk genes and common low penetrance variants may account for additional 20% of the cases. Thus the majority of cases are still unaccounted for and designated as BRCAX tumors. MicroRNAs are small non-coding RNAs that play important roles as regulators of gene expression and are deregulated in cancer. To characterize hereditary breast tumors based on their miRNA expression profiles we performed global microarray miRNA expression profiling on a retrospective cohort of 80 FFPE breast tissues, including 66 hereditary breast tumors (13 BRCA1, 10 BRCA2 and 43 BRCAX), 10 sporadic breast carcinomas and 4 normal breast tissues, using Exiqon miRCURY LNA™ microRNA Array v.11.0. Here we describe in detail the miRNA microarray expression data and tumor samples used for the study of BRCAX tumor heterogeneity (Tanic et al., 2013) and biomarkers associated with positive BRCA1/2 mutation status (Tanic et al., 2014). Additionally, we provide the R code for data preprocessing and quality control.
Collapse
Affiliation(s)
- Miljana Tanić
- Human Genetics Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Kira Yanowski
- Human Genetics Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Eduardo Andrés
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Gonzalo Gómez-López
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | - David G Pisano
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | - Javier Benítez
- Human Genetics Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain ; Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Madrid, Spain
| |
Collapse
|
31
|
Tanic M, Yanowski K, Gómez-López G, Rodriguez-Pinilla MS, Marquez-Rodas I, Osorio A, Pisano DG, Martinez-Delgado B, Benítez J. MicroRNA expression signatures for the prediction of BRCA1/2 mutation-associated hereditary breast cancer in paraffin-embedded formalin-fixed breast tumors. Int J Cancer 2014; 136:593-602. [PMID: 24917463 DOI: 10.1002/ijc.29021] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Accepted: 05/26/2014] [Indexed: 01/07/2023]
Abstract
Screening for germline mutations in breast cancer-associated genes BRCA1 and BRCA2 is indicated for patients with breast cancer from high-risk breast cancer families and influences both treatment options and clinical management. However, only 25% of selected patients test positive for BRCA1/2 mutation, indicating that additional diagnostic biomarkers are necessary. We analyzed 124 formalin-fixed paraffin-embedded (FFPE) tumor samples from patients with hereditary (104) and sporadic (20) invasive breast cancer, divided into two series (A and B). Microarray expression profiling of 829 human miRNAs was performed on 76 samples (Series A), and bioinformatics tool Prophet was used to develop and test a microarray classifier. Samples were stratified into a training set (n = 38) for microarray classifier generation and a test set (n = 38) for signature validation. A 35-miRNA microarray classifier was generated for the prediction of BRCA1/2 mutation status with a reported 95% (95% CI = 0.88-1.0) and 92% (95% CI: 0.84-1.0) accuracy in the training and the test set, respectively. Differential expression of 12 miRNAs between BRCA1/2 mutation carriers versus noncarriers was validated by qPCR in an independent tumor series B (n = 48). Logistic regression model based on the expression of six miRNAs (miR-142-3p, miR-505*, miR-1248, miR-181a-2*, miR-25* and miR-340*) discriminated between tumors from BRCA1/2 mutation carriers and noncarriers with 92% (95% CI: 0.84-0.99) accuracy. In conclusion, we identified miRNA expression signatures predictive of BRCA1/2 mutation status in routinely available FFPE breast tumor samples, which may be useful to complement current patient selection criteria for gene testing by identifying individuals with high likelihood of being BRCA1/2 mutation carriers.
Collapse
Affiliation(s)
- Miljana Tanic
- Human Genetics Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Emura T, Chen YH. Gene selection for survival data under dependent censoring: A copula-based approach. Stat Methods Med Res 2014; 25:2840-2857. [PMID: 24821000 DOI: 10.1177/0962280214533378] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Dependent censoring arises in biomedical studies when the survival outcome of interest is censored by competing risks. In survival data with microarray gene expressions, gene selection based on the univariate Cox regression analyses has been used extensively in medical research, which however, is only valid under the independent censoring assumption. In this paper, we first consider a copula-based framework to investigate the bias caused by dependent censoring on gene selection. Then, we utilize the copula-based dependence model to develop an alternative gene selection procedure. Simulations show that the proposed procedure adjusts for the effect of dependent censoring and thus outperforms the existing method when dependent censoring is indeed present. The non-small-cell lung cancer data are analyzed to demonstrate the usefulness of our proposal. We implemented the proposed method in an R "compound.Cox" package.
Collapse
Affiliation(s)
- Takeshi Emura
- Graduate Institute of Statistics, National Central University, Jhongli, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
33
|
Novianti PW, Roes KCB, Eijkemans MJC. Evaluation of gene expression classification studies: factors associated with classification performance. PLoS One 2014; 9:e96063. [PMID: 24770439 PMCID: PMC4000205 DOI: 10.1371/journal.pone.0096063] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2013] [Accepted: 04/03/2014] [Indexed: 12/22/2022] Open
Abstract
Classification methods used in microarray studies for gene expression are diverse in the way they deal with the underlying complexity of the data, as well as in the technique used to build the classification model. The MAQC II study on cancer classification problems has found that performance was affected by factors such as the classification algorithm, cross validation method, number of genes, and gene selection method. In this paper, we study the hypothesis that the disease under study significantly determines which method is optimal, and that additionally sample size, class imbalance, type of medical question (diagnostic, prognostic or treatment response), and microarray platform are potentially influential. A systematic literature review was used to extract the information from 48 published articles on non-cancer microarray classification studies. The impact of the various factors on the reported classification accuracy was analyzed through random-intercept logistic regression. The type of medical question and method of cross validation dominated the explained variation in accuracy among studies, followed by disease category and microarray platform. In total, 42% of the between study variation was explained by all the study specific and problem specific factors that we studied together.
Collapse
Affiliation(s)
- Putri W Novianti
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Kit C B Roes
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Marinus J C Eijkemans
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| |
Collapse
|
34
|
van den Berg BA, Reinders MJT, Roubos JA, de Ridder D. SPiCE: a web-based tool for sequence-based protein classification and exploration. BMC Bioinformatics 2014; 15:93. [PMID: 24685258 PMCID: PMC4021553 DOI: 10.1186/1471-2105-15-93] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2013] [Accepted: 03/26/2014] [Indexed: 12/16/2022] Open
Abstract
Background Amino acid sequences and features extracted from such sequences have been used to predict many protein properties, such as subcellular localization or solubility, using classifier algorithms. Although software tools are available for both feature extraction and classifier construction, their application is not straightforward, requiring users to install various packages and to convert data into different formats. This lack of easily accessible software hampers quick, explorative use of sequence-based classification techniques by biologists. Results We have developed the web-based software tool SPiCE for exploring sequence-based features of proteins in predefined classes. It offers data upload/download, sequence-based feature calculation, data visualization and protein classifier construction and testing in a single integrated, interactive environment. To illustrate its use, two example datasets are included showing the identification of differences in amino acid composition between proteins yielding low and high production levels in fungi and low and high expression levels in yeast, respectively. Conclusions SPiCE is an easy-to-use online tool for extracting and exploring sequence-based features of sets of proteins, allowing non-experts to apply advanced classification techniques. The tool is available at http://helix.ewi.tudelft.nl/spice.
Collapse
Affiliation(s)
- Bastiaan A van den Berg
- Delft Bioinformatics Lab, Department of Intelligent Systems, Faculty Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628CD, Delft, The Netherlands.
| | | | | | | |
Collapse
|
35
|
Breast cancer subtype specific classifiers of response to neoadjuvant chemotherapy do not outperform classifiers trained on all subtypes. PLoS One 2014; 9:e88551. [PMID: 24558399 PMCID: PMC3928239 DOI: 10.1371/journal.pone.0088551] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 01/06/2014] [Indexed: 11/19/2022] Open
Abstract
Introduction Despite continuous efforts, not a single predictor of breast cancer chemotherapy resistance has made it into the clinic yet. However, it has become clear in recent years that breast cancer is a collection of molecularly distinct diseases. With ever increasing amounts of breast cancer data becoming available, we set out to study if gene expression based predictors of chemotherapy resistance that are specific for breast cancer subtypes can improve upon the performance of generic predictors. Methods We trained predictors of resistance that were specific for a subtype and generic predictors that were not specific for a particular subtype, i.e. trained on all subtypes simultaneously. Through a rigorous double-loop cross-validation we compared the performance of these two types of predictors on the different subtypes on a large set of tumors all profiled on the same expression platform (n = 394). We evaluated predictors based on either mRNA gene expression or clinical features. Results For HER2+, ER− breast cancer, subtype specific predictor based on clinical features outperformed the generic, non-specific predictor. This can be explained by the fact that the generic predictor included HER2 and ER status, features that are predictive over the whole set, but not within this subtype. In all other scenarios the generic predictors outperformed the subtype specific predictors or showed equal performance. Conclusions Since it depends on the specific context which type of predictor – subtype specific or generic- performed better, it is highly recommended to evaluate both specific and generic predictors when attempting to predict treatment response in breast cancer.
Collapse
|
36
|
Staiger C, Cadot S, Györffy B, Wessels LFA, Klau GW. Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis. Front Genet 2013; 4:289. [PMID: 24391662 PMCID: PMC3870302 DOI: 10.3389/fgene.2013.00289] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Accepted: 11/28/2013] [Indexed: 01/21/2023] Open
Abstract
Integrating gene expression data with secondary data such as pathway or protein-protein interaction data has been proposed as a promising approach for improved outcome prediction of cancer patients. Methods employing this approach usually aggregate the expression of genes into new composite features, while the secondary data guide this aggregation. Previous studies were limited to few data sets with a small number of patients. Moreover, each study used different data and evaluation procedures. This makes it difficult to objectively assess the gain in classification performance. Here we introduce the Amsterdam Classification Evaluation Suite (ACES). ACES is a Python package to objectively evaluate classification and feature-selection methods and contains methods for pooling and normalizing Affymetrix microarrays from different studies. It is simple to use and therefore facilitates the comparison of new approaches to best-in-class approaches. In addition to the methods described in our earlier study (Staiger et al., 2012), we have included two prominent prognostic gene signatures specific for breast cancer outcome, one more composite feature selection method and two network-based gene ranking methods. Employing the evaluation pipeline we show that current composite-feature classification methods do not outperform simple single-genes classifiers in predicting outcome in breast cancer. Furthermore, we find that also the stability of features across different data sets is not higher for composite features. Most stunningly, we observe that prediction performances are not affected when extracting features from randomized PPI networks.
Collapse
Affiliation(s)
- Christine Staiger
- Life Sciences, Centrum Wiskunde & Informatica Amsterdam, Netherlands ; Computational Cancer Biology, Division of Molecular Carcinogenesis, Netherlands Cancer Institute Amsterdam, Netherlands
| | - Sidney Cadot
- Computational Cancer Biology, Division of Molecular Carcinogenesis, Netherlands Cancer Institute Amsterdam, Netherlands
| | - Balázs Györffy
- Research Laboratory of Pediatrics and Nephrology, Hungarian Academy of Sciences Budapest, Hungary
| | - Lodewyk F A Wessels
- Computational Cancer Biology, Division of Molecular Carcinogenesis, Netherlands Cancer Institute Amsterdam, Netherlands ; Cancer Systems Biology Center, Netherlands Cancer Institute Amsterdam, Netherlands ; Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, TU Delft Delft, Netherlands
| | - Gunnar W Klau
- Life Sciences, Centrum Wiskunde & Informatica Amsterdam, Netherlands ; Operations Research and Bioinformatics, Faculty of Sciences, VU University Amsterdam Amsterdam, Netherlands
| |
Collapse
|
37
|
Van den broeck A, Vankelecom H, Van Delm W, Gremeaux L, Wouters J, Allemeersch J, Govaere O, Roskams T, Topal B. Human pancreatic cancer contains a side population expressing cancer stem cell-associated and prognostic genes. PLoS One 2013; 8:e73968. [PMID: 24069258 PMCID: PMC3775803 DOI: 10.1371/journal.pone.0073968] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2013] [Accepted: 07/23/2013] [Indexed: 12/17/2022] Open
Abstract
In many types of cancers, a side population (SP) has been identified based on high efflux capacity, thereby enriching for chemoresistant cells as well as for candidate cancer stem cells (CSC). Here, we explored whether human pancreatic ductal adenocarcinoma (PDAC) contains a SP, and whether its gene expression profile is associated with chemoresistance, CSC and prognosis. After dispersion into single cells and incubation with Hoechst dye, we analyzed human PDAC resections specimens using flow cytometry (FACS). We identified a SP and main population (MP) in all human PDAC resection specimens (n = 52) analyzed, but detected immune (CD45+) and endothelial (CD31+) cells in this fraction together with tumor cells. The SP and MP cells, or more purified fractions depleted from CD31+/CD45+ cells (pSP and pMP), were sorted by FACS and subjected to whole-genome expression analysis. This revealed upregulation of genes associated with therapy resistance and of markers identified before in putative pancreatic CSC. pSP gene signatures of 32 or 10 up- or downregulated genes were developed and tested for discriminatory competence between pSP and pMP in different sets of PDAC samples. The prognostic value of the pSP genes was validated in a large independent series of PDAC patients (n = 78) using nCounter analysis of expression (in tumor versus surrounding pancreatic tissue) and Cox regression for disease-free and overall survival. Of these genes, expression levels of ABCB1 and CXCR4 were correlated with worse patient survival. Thus, our study for the first time demonstrates that human PDAC contains a SP. This tumor subpopulation may represent a valuable therapeutic target given its chemoresistance- and CSC-associated gene expression characteristics with potential prognostic value.
Collapse
MESH Headings
- ATP Binding Cassette Transporter, Subfamily B
- ATP Binding Cassette Transporter, Subfamily B, Member 1/genetics
- ATP Binding Cassette Transporter, Subfamily B, Member 1/metabolism
- Adult
- Aged
- Aged, 80 and over
- Carcinoma, Pancreatic Ductal/genetics
- Carcinoma, Pancreatic Ductal/metabolism
- Carcinoma, Pancreatic Ductal/mortality
- Case-Control Studies
- Female
- Gene Expression Profiling
- Humans
- Immunophenotyping
- Male
- Middle Aged
- Neoplastic Stem Cells/metabolism
- Pancreatic Neoplasms/genetics
- Pancreatic Neoplasms/metabolism
- Pancreatic Neoplasms/mortality
- Prognosis
- Receptors, CXCR4/genetics
- Receptors, CXCR4/metabolism
- Side-Population Cells/metabolism
Collapse
Affiliation(s)
- Anke Van den broeck
- Department of Abdominal Surgery, University Hospitals Leuven, Leuven, Belgium
- Laboratory of Tissue Plasticity, Research Unit of Embryo and Stem Cells, Department of Development & Regeneration, University of Leuven (KU Leuven), Leuven, Belgium
| | - Hugo Vankelecom
- Laboratory of Tissue Plasticity, Research Unit of Embryo and Stem Cells, Department of Development & Regeneration, University of Leuven (KU Leuven), Leuven, Belgium
| | - Wouter Van Delm
- VIB Nucleomics Core, University of Leuven (KU Leuven), Leuven, Belgium
| | - Lies Gremeaux
- Laboratory of Tissue Plasticity, Research Unit of Embryo and Stem Cells, Department of Development & Regeneration, University of Leuven (KU Leuven), Leuven, Belgium
| | - Jasper Wouters
- Laboratory of Tissue Plasticity, Research Unit of Embryo and Stem Cells, Department of Development & Regeneration, University of Leuven (KU Leuven), Leuven, Belgium
| | - Joke Allemeersch
- VIB Nucleomics Core, University of Leuven (KU Leuven), Leuven, Belgium
| | - Olivier Govaere
- Department of Pathology, University Hospitals Leuven, Leuven, Belgium
| | - Tania Roskams
- Department of Pathology, University Hospitals Leuven, Leuven, Belgium
| | - Baki Topal
- Department of Abdominal Surgery, University Hospitals Leuven, Leuven, Belgium
- * E-mail:
| |
Collapse
|
38
|
Hedjazi L, Le Lann MV, Kempowsky T, Dalenc F, Aguilar-Martin J, Favre G. Symbolic data analysis to defy low signal-to-noise ratio in microarray data for breast cancer prognosis. J Comput Biol 2013; 20:610-20. [PMID: 23899014 DOI: 10.1089/cmb.2012.0249] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Microarray profiling has recently generated the hope to gain new insights into breast cancer biology and thereby improve the performance of current prognostic tools. However, it also poses several serious challenges to classical data analysis techniques related to the characteristics of resulting data, mainly high dimensionality and low signal-to-noise ratio. Despite the tremendous research work performed to handle the first challenge in the feature selection framework, very little attention has been directed to address the second one. We propose in this article to address both issues simultaneously based on symbolic data analysis capabilities in order to derive more accurate genetic marker-based prognostic models. In particular, interval data representation is employed to model various uncertainties in microarray measurements. A recent feature selection algorithm that handles symbolic interval data is used then to derive a genetic signature. The predictive value of the derived signature is then assessed by following a rigorous experimental setup and compared with existing prognostic approaches in terms of predictive performance and estimated survival probability. It is shown that the derived signature (GenSym) performs significantly better than other prognostic models, including the 70-gene signature, St. Gallen, and National Institutes of Health criteria.
Collapse
|
39
|
García-Closas M, Gail MH, Kelsey KT, Ziegler RG. Searching for blood DNA methylation markers of breast cancer risk and early detection. J Natl Cancer Inst 2013; 105:678-80. [PMID: 23578855 DOI: 10.1093/jnci/djt090] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
|
40
|
Xu Z, Bolick SCE, DeRoo LA, Weinberg CR, Sandler DP, Taylor JA. Epigenome-wide association study of breast cancer using prospectively collected sister study samples. J Natl Cancer Inst 2013; 105:694-700. [PMID: 23578854 DOI: 10.1093/jnci/djt045] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Previous studies have suggested DNA methylation in blood is a potential epigenetic marker of cancer risk, but this has not been evaluated on a genome-wide scale in prospective studies for breast cancer. METHODS We measured DNA methylation at 27578 CpGs in blood samples from 298 women who developed breast cancer 0 to 5 years after enrollment in the Sister Study cohort and compared them with a random sample of 612 cohort women who remained cancer free. We also genotyped women for nine common polymorphisms associated with breast cancer. RESULTS We identified 250 differentially methylated CpGs (dmCpGs) between case subjects and noncase subjects (false discovery rate [FDR] Q < 0.05). Of these dmCpGs, 75.2% were undermethylated in case subjects relative to noncase subjects. Women diagnosed within 1 year of blood draw had small but consistently greater divergence from noncase subjects than did women diagnosed at more than 1 year. Gene set enrichment analysis identified Kyoto Encyclopedia of Genes and Genomes cancer pathways at the recommended FDR of Q less than 0.25. Receiver operating characteristic analysis estimated a prediction accuracy of 65.8% (95% confidence interval = 61.0% to 70.5%) for methylation, compared with 56.0% for the Gail model and 58.8% for genome-wide association study polymorphisms. The prediction accuracy of just five dmCpGs (64.1%) was almost as good as the larger panel and was similar (63.1%) when replicated in a small sample of 81 women with diverse ethnic backgrounds. CONCLUSIONS Methylation profiling of blood holds promise for breast cancer detection and risk prediction.
Collapse
Affiliation(s)
- Zongli Xu
- Epidemiology Branch, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC 27709, USA
| | | | | | | | | | | |
Collapse
|
41
|
de Ridder D, de Ridder J, Reinders MJT. Pattern recognition in bioinformatics. Brief Bioinform 2013; 14:633-47. [DOI: 10.1093/bib/bbt020] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
42
|
van Vliet MH, Burgmer P, de Quartel L, Brand JPL, de Best LCM, Viëtor H, Löwenberg B, Valk PJM, van Beers EH. Detection of CEBPA double mutants in acute myeloid leukemia using a custom gene expression array. Genet Test Mol Biomarkers 2013; 17:395-400. [PMID: 23485358 DOI: 10.1089/gtmb.2012.0437] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Double (bi-allelic) mutations in the gene encoding the CCAAT/enhancer-binding protein-alpha (CEBPA) transcription factor have a favorable prognostic impact in acute myeloid leukemia (AML). Double mutations in CEBPA can be detected using various techniques, but it is a notoriously difficult gene to sequence due to its high GC-content. Here we developed a two-step gene expression classifier for accurate and standardized detection of CEBPA double mutations. The key feature of the two-step classifier is that it explicitly removes cases with low CEBPA expression, thereby excluding CEBPA hypermethylated cases that have similar gene expression profiles as a CEBPA double mutant, which would result in false-positive predictions. In the second step, we have developed a 55 gene signature to identity the true CEBPA double-mutation cases. This two-step classifier was tested on a cohort of 505 unselected AML cases, including 26 CEBPA double mutants, 12 CEBPA single mutants, and seven CEBPA promoter hypermethylated cases, on which its performance was estimated by a double-loop cross-validation protocol. The two-step classifier achieves a sensitivity of 96.2% (95% confidence interval [CI] 81.1 to 99.3) and specificity of 100.0% (95% CI 99.2 to 100.0). There are no false-positive detections. This two-step CEBPA double-mutation classifier has been incorporated on a microarray platform that can simultaneously detect other relevant molecular biomarkers, which allows for a standardized comprehensive diagnostic assay. In conclusion, gene expression profiling provides a reliable method for CEBPA double-mutation detection in patients with AML for clinical use.
Collapse
|
43
|
Cranford SW, de Boer J, van Blitterswijk C, Buehler MJ. Materiomics: an -omics approach to biomaterials research. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2013; 25:802-24. [PMID: 23297023 DOI: 10.1002/adma.201202553] [Citation(s) in RCA: 83] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2012] [Revised: 10/13/2012] [Indexed: 05/20/2023]
Abstract
The past fifty years have seen a surge in the use of materials for clinical application, but in order to understand and exploit their full potential, the scientific complexity at both sides of the interface--the material on the one hand and the living organism on the other hand--needs to be considered. Technologies such as combinatorial chemistry, recombinant DNA as well as computational multi-scale methods can generate libraries with a very large number of material properties whereas on the other side, the body will respond to them depending on the biological context. Typically, biological systems are investigated using both holistic and reductionist approaches such as whole genome expression profiling, systems biology and high throughput genetic or compound screening, as already seen, for example, in pharmacology and genetics. The field of biomaterials research is only beginning to develop and adopt these approaches, an effort which we refer to as "materiomics". In this review, we describe the current status of the field, and its past and future impact on the biomedical sciences. We outline how materiomics sets the stage for a transformative change in the approach to biomaterials research to enable the design of tailored and functional materials for a variety of properties in fields as diverse as tissue engineering, disease diagnosis and de novo materials design, by combining powerful computational modelling and screening with advanced experimental techniques.
Collapse
Affiliation(s)
- Steven W Cranford
- Laboratory for Atomistic and Molecular Mechanics, Department of Civil and Environmental Engineering, Center for Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | | | | |
Collapse
|
44
|
Urquidi V, Goodison S, Cai Y, Sun Y, Rosser CJ. A candidate molecular biomarker panel for the detection of bladder cancer. Cancer Epidemiol Biomarkers Prev 2012; 21:2149-58. [PMID: 23097579 DOI: 10.1158/1055-9965.epi-12-0428] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Bladder cancer is among the five most common malignancies worldwide, and due to high rates of recurrence, one of the most prevalent. Improvements in noninvasive urine-based assays to detect bladder cancer would benefit both patients and health care systems. In this study, the goal was to identify urothelial cell transcriptomic signatures associated with bladder cancer. METHODS Gene expression profiling (Affymetrix U133 Plus 2.0 arrays) was applied to exfoliated urothelia obtained from a cohort of 92 subjects with known bladder disease status. Computational analyses identified candidate biomarkers of bladder cancer and an optimal predictive model was derived. Selected targets from the profiling analyses were monitored in an independent cohort of 81 subjects using quantitative real-time PCR (RT-PCR). RESULTS Transcriptome profiling data analysis identified 52 genes associated with bladder cancer (P ≤ 0.001) and gene models that optimally predicted class label were derived. RT-PCR analysis of 48 selected targets in an independent cohort identified a 14-gene diagnostic signature that predicted the presence of bladder cancer with high accuracy. CONCLUSIONS Exfoliated urothelia sampling provides a robust analyte for the evaluation of patients with suspected bladder cancer. The refinement and validation of the multigene urothelial cell signatures identified in this preliminary study may lead to accurate, noninvasive assays for the detection of bladder cancer. IMPACT The development of an accurate, noninvasive bladder cancer detection assay would benefit both the patient and health care systems through better detection, monitoring, and control of disease.
Collapse
Affiliation(s)
- Virginia Urquidi
- Cancer Research Institute, M.D. Anderson Cancer Center Orlando, Orlando, FL, USA
| | | | | | | | | |
Collapse
|
45
|
Leung YY, Chang CQ, Hung YS. An integrated approach for identifying wrongly labelled samples when performing classification in microarray data. PLoS One 2012; 7:e46700. [PMID: 23082127 PMCID: PMC3474777 DOI: 10.1371/journal.pone.0046700] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2012] [Accepted: 09/03/2012] [Indexed: 01/05/2023] Open
Abstract
Background Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. Results We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Collapse
Affiliation(s)
- Yuk Yee Leung
- Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong Special Administrative Region, China.
| | | | | |
Collapse
|
46
|
Glaab E, Bacardit J, Garibaldi JM, Krasnogor N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS One 2012; 7:e39932. [PMID: 22808075 PMCID: PMC3394775 DOI: 10.1371/journal.pone.0039932] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2012] [Accepted: 05/29/2012] [Indexed: 12/19/2022] Open
Abstract
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
Collapse
Affiliation(s)
- Enrico Glaab
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jaume Bacardit
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Jonathan M. Garibaldi
- Intelligent Modeling and Analysis (IMA) Research Group, University of Nottingham, Nottingham, United Kingdom
| | - Natalio Krasnogor
- Interdisciplinary Computing and Complex Systems (ICOS) Research Group, University of Nottingham, Nottingham, United Kingdom
| |
Collapse
|
47
|
van Vliet MH, Horlings HM, van de Vijver MJ, Reinders MJT, Wessels LFA. Integration of clinical and gene expression data has a synergetic effect on predicting breast cancer outcome. PLoS One 2012; 7:e40358. [PMID: 22808140 PMCID: PMC3394805 DOI: 10.1371/journal.pone.0040358] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Accepted: 06/06/2012] [Indexed: 12/12/2022] Open
Abstract
Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.
Collapse
Affiliation(s)
- Martin H van Vliet
- Delft Bioinformatics Laboratory, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg, Delft, The Netherlands.
| | | | | | | | | |
Collapse
|
48
|
Staiger C, Cadot S, Kooter R, Dittrich M, Müller T, Klau GW, Wessels LFA. A critical evaluation of network and pathway-based classifiers for outcome prediction in breast cancer. PLoS One 2012; 7:e34796. [PMID: 22558100 PMCID: PMC3338754 DOI: 10.1371/journal.pone.0034796] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 03/09/2012] [Indexed: 12/19/2022] Open
Abstract
Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer.
Collapse
Affiliation(s)
- Christine Staiger
- Centrum Wiskunde & Informatica, Life Sciences Group, The Netherlands
- Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- * E-mail: (CS); (GWK); (LFAW)
| | - Sidney Cadot
- Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Raul Kooter
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft, The Netherlands
| | - Marcus Dittrich
- Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
| | - Tobias Müller
- Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany
| | - Gunnar W. Klau
- Centrum Wiskunde & Informatica, Life Sciences Group, The Netherlands
- Netherlands Institute for Systems Biology, Amsterdam, The Netherlands
- * E-mail: (CS); (GWK); (LFAW)
| | - Lodewyk F. A. Wessels
- Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft, The Netherlands
- Cancer Systems Biology Center, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- * E-mail: (CS); (GWK); (LFAW)
| |
Collapse
|
49
|
van Iterson M, van Haagen HHHBM, Goeman JJ. Resolving confusion of tongues in statistics and machine learning: A primer for biologists and bioinformaticians. Proteomics 2012; 12:543-9. [DOI: 10.1002/pmic.201100395] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2011] [Revised: 11/09/2011] [Accepted: 11/14/2011] [Indexed: 11/06/2022]
|
50
|
Robust two-gene classifiers for cancer prediction. Genomics 2011; 99:90-5. [PMID: 22138042 DOI: 10.1016/j.ygeno.2011.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 11/04/2011] [Accepted: 11/09/2011] [Indexed: 11/23/2022]
Abstract
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.
Collapse
|