1
|
Ali SH, Shehata M. A New Breast Cancer Discovery Strategy: A Combined Outlier Rejection Technique and an Ensemble Classification Method. Bioengineering (Basel) 2024; 11:1148. [PMID: 39593808 PMCID: PMC11591806 DOI: 10.3390/bioengineering11111148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 11/08/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
Annually, many people worldwide lose their lives due to breast cancer, making it one of the most prevalent cancers in the world. Since the disease is becoming more common, early detection of breast cancer is essential to avoiding serious complications and possibly death as well. This research provides a novel Breast Cancer Discovery (BCD) strategy to aid patients by providing prompt and sensitive detection of breast cancer. The two primary steps that form the BCD are the Breast Cancer Discovery Step (BCDS) and the Pre-processing Step (P2S). In the P2S, the needed data are filtered from any non-informative data using three primary operations: data normalization, feature selection, and outlier rejection. Only then does the diagnostic model in the BCDS for precise diagnosis begin to be trained. The primary contribution of this research is the novel outlier rejection technique known as the Combined Outlier Rejection Technique (CORT). CORT is divided into two primary phases: (i) the Quick Rejection Phase (QRP), which is a quick phase utilizing a statistical method, and (ii) the Accurate Rejection Phase (ARP), which is a precise phase using an optimization method. Outliers are rapidly eliminated during the QRP using the standard deviation, and the remaining outliers are thoroughly eliminated during ARP via Binary Harris Hawk Optimization (BHHO). The P2S in the BCD strategy indicates that data normalization is a pre-processing approach used to find numeric values in the datasets that fall into a predetermined range. Information Gain (IG) is then used to choose the optimal subset of features, and CORT is used to reject incorrect training data. Furthermore, based on the filtered data from the P2S, an Ensemble Classification Method (ECM) is utilized in the BCDS to identify breast cancer patients. This method consists of three classifiers: Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). The Wisconsin Breast Cancer Database (WBCD) dataset, which contains digital images of fine-needle aspiration samples collected from patients' breast masses, is used herein to compare the BCD strategy against several contemporary strategies. According to the outcomes of the experiment, the suggested method is very competitive. It achieves 0.987 accuracy, 0.013 error, 0.98 recall, 0.984 precision, and a run time of 3 s, outperforming all other methods from the literature.
Collapse
Affiliation(s)
- Shereen H. Ali
- Communications & Electronics Engineering Department, Delta Higher Institute for Engineering & Technology, Mansoura 35511, Egypt;
| | - Mohamed Shehata
- Department of Bioengineering, Speed School of Engineering, University of Louisville, Louisville, KY 40292, USA
| |
Collapse
|
2
|
Manes NP, Song J, Nita-lazar A. EnsMOD: A Software Program for Omics Sample Outlier Detection. J Comput Biol 2023; 30:726-735. [PMID: 37042708 PMCID: PMC10282819 DOI: 10.1089/cmb.2022.0243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2023] Open
Abstract
Detection of omics sample outliers is important for preventing erroneous biological conclusions, developing robust experimental protocols, and discovering rare biological states. Two recent publications describe robust algorithms for detecting transcriptomic sample outliers, but neither algorithm had been incorporated into a software tool for scientists. Here we describe Ensemble Methods for Outlier Detection (EnsMOD) which incorporates both algorithms. EnsMOD calculates how closely the quantitation variation follows a normal distribution, plots the density curves of each sample to visualize anomalies, performs hierarchical cluster analyses to calculate how closely the samples cluster with each other, and performs robust principal component analyses to statistically test if any sample is an outlier. The probabilistic threshold parameters can be easily adjusted to tighten or loosen the outlier detection stringency. EnsMOD can be used to analyze any omics dataset with normally distributed variance. Here it was used to analyze a simulated proteomics dataset, a multiomic (proteome and transcriptome) dataset, a single-cell proteomics dataset, and a phosphoproteomics dataset. EnsMOD successfully identified all of the simulated outliers, and subsequent removal of a detected outlier improved data quality for downstream statistical analyses.
Collapse
Affiliation(s)
- Nathan P. Manes
- Laboratory of Immune System Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Jian Song
- Laboratory of Immune System Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| | - Aleksandra Nita-lazar
- Laboratory of Immune System Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
3
|
Sialyl Lewis X/A and Cytokeratin Crosstalk in Triple Negative Breast Cancer. Cancers (Basel) 2023; 15:cancers15030731. [PMID: 36765690 PMCID: PMC9913872 DOI: 10.3390/cancers15030731] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Revised: 12/31/2022] [Accepted: 01/23/2023] [Indexed: 01/27/2023] Open
Abstract
Triple-negative breast cancer (TNBC) encompasses multiple entities and is generally highly aggressive and metastatic. We aimed to determine the clinical and biological relevance of Sialyl-Lewis X and A (sLeX/A)-a fucosylated glycan involved in metastasis-in TNBC. Here, we studied tissues from 50 TNBC patients, transcripts from a TNBC dataset from The Cancer Genome Atlas (TCGA) database, and a primary breast cancer cell line. All 50 TNBC tissue samples analysed expressed sLeX/A. Patients with high expression of sLeX/A had 3 years less disease-free survival than patients with lower expression. In tissue, sLeX/A negatively correlated with cytokeratins 5/6 (CK5/6, which was corroborated by the inverse correlation between fucosyltransferases and CK5/6 genes. Our observations were confirmed in vitro when inhibition of sLeX/A remarkably increased expression of CK5/6, followed by a decreased proliferation and invasion capacity. Among the reported glycoproteins bearing sLeX/A and based on the STRING tool, α6 integrin showed the highest interaction score with CK5/6. This is the first report on the sLeX/A expression in TNBC, highlighting its association with lower disease-free survival and its inverse crosstalk with CK5/6 with α6 integrin as a mediator. All in all, sLeX/A is critical for TNBC malignancy and a potential prognosis biomarker and therapeutic target.
Collapse
|
4
|
Ren M, Zhang S, Ma S, Zhang Q. Gene-environment interaction identification via penalized robust divergence. Biom J 2022; 64:461-480. [PMID: 34725857 PMCID: PMC9386692 DOI: 10.1002/bimj.202000157] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 06/01/2021] [Accepted: 08/23/2021] [Indexed: 12/11/2022]
Abstract
In high-throughput cancer studies, gene-environment interactions associated with outcomes have important implications. Some commonly adopted identification methods do not respect the "main effect, interaction" hierarchical structure. In addition, they can be challenged by data contamination and/or long-tailed distributions, which are not uncommon. In this article, robust methods based on γ $\gamma$ -divergence and density power divergence are proposed to accommodate contaminated data/long-tailed distributions. A hierarchical sparse group penalty is adopted for regularized estimation and selection and can identify important gene-environment interactions and respect the "main effect, interaction" hierarchical structure. The proposed methods are implemented using an effective group coordinate descent algorithm. Simulation shows that when contamination occurs, the proposed methods can significantly outperform the existing alternatives with more accurate identification. The proposed approach is applied to the analysis of The Cancer Genome Atlas (TCGA) triple-negative breast cancer data and Gene Environment Association Studies (GENEVA) Type 2 Diabetes data.
Collapse
Affiliation(s)
- Mingyang Ren
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, Fujian Key Lab of Statistics, Xiamen University, Fujian, P. R. China
| |
Collapse
|
5
|
Jensch A, Lopes MB, Vinga S, Radde N. ROSIE: RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data. Stat Methods Med Res 2022; 31:947-958. [PMID: 35072570 PMCID: PMC9014683 DOI: 10.1177/09622802211072456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The extraction of novel information from omics data is a challenging task, in
particular, since the number of features (e.g. genes) often far exceeds the
number of samples. In such a setting, conventional parameter estimation leads to
ill-posed optimization problems, and regularization may be required. In
addition, outliers can largely impact classification accuracy. Here we introduce ROSIE, an ensemble classification approach, which combines
three sparse and robust classification methods for outlier detection and feature
selection and further performs a bootstrap-based validity check. Outliers of
ROSIE are determined by the rank product test using outlier rankings of all
three methods, and important features are selected as features commonly selected
by all methods. We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify
observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue
samples. The pre-processed dataset consists of 16,600 genes and more than 1,000 samples. We demonstrate that ROSIE selects important features
and outliers in a robust way. Identified outliers are concordant with the
distribution of the commonly selected genes by the three methods, and results
are in line with other independent studies. Furthermore, we discuss the
association of some of the selected genes with the TNBC subtype in other
investigations. In summary, ROSIE constitutes a robust and sparse procedure to
identify outliers and important genes through binary classification. Our
approach is ad hoc applicable to other datasets, fulfilling the overall goal of
simultaneously identifying outliers and candidate disease biomarkers to the
targeted in therapy research and personalized medicine frameworks.
Collapse
Affiliation(s)
- Antje Jensch
- Institute for Systems Theory and Automatic Control, 9149University of Stuttgart, Germany
| | - Marta B Lopes
- Center for Mathematics and Applications (CMA), NOVA School of Science and Technology, Caparica, Portugal.,NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), NOVA School of Science and Technology, Caparica, Portugal
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, 72971Universidade de Lisboa, Portugal.,IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Portugal
| | - Nicole Radde
- Institute for Systems Theory and Automatic Control, 9149University of Stuttgart, Germany
| |
Collapse
|
6
|
An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2021:9436582. [PMID: 34976114 PMCID: PMC8716222 DOI: 10.1155/2021/9436582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/30/2021] [Indexed: 11/18/2022]
Abstract
High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.
Collapse
|
7
|
Ko J, Jeong J, Son S, Lee J. Cellular and biomolecular detection based on suspended microchannel resonators. Biomed Eng Lett 2021; 11:367-382. [PMID: 34616583 DOI: 10.1007/s13534-021-00207-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 08/23/2021] [Accepted: 09/03/2021] [Indexed: 12/31/2022] Open
Abstract
Suspended microchannel resonators (SMRs) have been developed to measure the buoyant mass of single micro-/nanoparticles and cells suspended in a liquid. They have significantly improved the mass resolution with the aid of vacuum packaging and also increased measurement throughput by fast resonance frequency tracking while target objects travel through the microchannel without stopping or even slowing down. Since their invention, various biological applications have been enabled, including simultaneous measurements of cell growth and cell cycle progression, and measurements of disease associated physicochemical change, to name a few. Extension and advancement towards other promising applications with SMRs are continuously ongoing by adding multiple functionalities or incorporating other complementary analytical metrologies. In this paper, we will thoroughly review the development history, basic and advanced operations, and key applications of SMRs to introduce them to researchers working in biological and biomedical sciences who mostly rely on classical and conventional methodologies. We will also provide future perspectives and projections for SMR technologies.
Collapse
Affiliation(s)
- Juhee Ko
- Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro 291, Daejeon, South Korea
| | - Jaewoo Jeong
- Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro 291, Daejeon, South Korea
| | - Sukbom Son
- Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro 291, Daejeon, South Korea
| | - Jungchul Lee
- Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daehak-ro 291, Daejeon, South Korea
| |
Collapse
|
8
|
Erlic Z, Reel P, Reel S, Amar L, Pecori A, Larsen CK, Tetti M, Pamporaki C, Prehn C, Adamski J, Prejbisz A, Ceccato F, Scaroni C, Kroiss M, Dennedy MC, Deinum J, Langton K, Mulatero P, Reincke M, Lenzini L, Gimenez-Roqueplo AP, Assié G, Blanchard A, Zennaro MC, Jefferson E, Beuschlein F. Targeted Metabolomics as a Tool in Discriminating Endocrine From Primary Hypertension. J Clin Endocrinol Metab 2021; 106:1111-1128. [PMID: 33382876 PMCID: PMC7993566 DOI: 10.1210/clinem/dgaa954] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Indexed: 12/11/2022]
Abstract
CONTEXT Identification of patients with endocrine forms of hypertension (EHT) (primary hyperaldosteronism [PA], pheochromocytoma/paraganglioma [PPGL], and Cushing syndrome [CS]) provides the basis to implement individualized therapeutic strategies. Targeted metabolomics (TM) have revealed promising results in profiling cardiovascular diseases and endocrine conditions associated with hypertension. OBJECTIVE Use TM to identify distinct metabolic patterns between primary hypertension (PHT) and EHT and test its discriminating ability. METHODS Retrospective analyses of PHT and EHT patients from a European multicenter study (ENSAT-HT). TM was performed on stored blood samples using liquid chromatography mass spectrometry. To identify discriminating metabolites a "classical approach" (CA) (performing a series of univariate and multivariate analyses) and a "machine learning approach" (MLA) (using random forest) were used.The study included 282 adult patients (52% female; mean age 49 years) with proven PHT (n = 59) and EHT (n = 223 with 40 CS, 107 PA, and 76 PPGL), respectively. RESULTS From 155 metabolites eligible for statistical analyses, 31 were identified discriminating between PHT and EHT using the CA and 27 using the MLA, of which 16 metabolites (C9, C16, C16:1, C18:1, C18:2, arginine, aspartate, glutamate, ornithine, spermidine, lysoPCaC16:0, lysoPCaC20:4, lysoPCaC24:0, PCaeC42:0, SM C18:1, SM C20:2) were found by both approaches. The receiver operating characteristic curve built on the top 15 metabolites from the CA provided an area under the curve (AUC) of 0.86, which was similar to the performance of the 15 metabolites from MLA (AUC 0.83). CONCLUSION TM identifies distinct metabolic pattern between PHT and EHT providing promising discriminating performance.
Collapse
Affiliation(s)
- Zoran Erlic
- Klinik für Endokrinologie, Diabetologie und Klinische Ernährung, UniversitätsSpital Zürich, Zurich, Switzerland
| | - Parminder Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Smarti Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Laurence Amar
- Université de Paris, PARCC, INSERM, Paris, France
- Assistance Publique-Hôpitaux de Paris, Hôpital Européen Georges Pompidou, Unité Hypertension artérielle, Paris, France
| | - Alessio Pecori
- Division of Internal Medicine and Hypertension Unit, Department of Medical Sciences, University of Torino, Italy
| | | | - Martina Tetti
- Division of Internal Medicine and Hypertension Unit, Department of Medical Sciences, University of Torino, Italy
| | - Christina Pamporaki
- Institute of Clinical Chemistry and Laboratory Medicine, Universitätsklinikum Carl Gustav Carus, Dresden, Germany
| | - Cornelia Prehn
- Research Unit Molecular Endocrinology and Metabolism, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
| | - Jerzy Adamski
- Research Unit Molecular Endocrinology and Metabolism, Genome Analysis Center, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
- Lehrstuhl für Experimentelle Genetik, Technische Universität München, Freising-Weihenstephan, Germany
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 8 Medical Drive, Singapore, Singapore
| | - Aleksander Prejbisz
- Department of Hypertension, National Institute of Cardiology, Warsaw, Poland
| | - Filippo Ceccato
- UOC Endocrinologia, Dipartimento di Medicina DIMED, Azienda Ospedaliera-Università di Padova, Padua, Italy
| | - Carla Scaroni
- UOC Endocrinologia, Dipartimento di Medicina DIMED, Azienda Ospedaliera-Università di Padova, Padua, Italy
| | - Matthias Kroiss
- Clinical Chemistry and Laboratory Medicine, Core Unit Clinical Mass Spectrometry, Universitätsklinikum Würzburg, Germany
- Schwerpunkt Endokrinologie/Diabetologie, Medizinische Klinik und Poliklinik I, Universitätsklinikum Würzburg, Germany
- Comprehensive Cancer Center Mainfranken, Universität Würzburg, Würzburg, Germany
- Medizinische Klinik und Poliklinik IV, Klinikum der Universität München, LMU München, Munich, Germany
| | - Michael C Dennedy
- The Discipline of Pharmacology and Therapeutics, School of Medicine, National University of Ireland 33 Galway, Ireland
| | - Jaap Deinum
- Department of Medicine, Section of Vascular Medicine, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Katharina Langton
- Institute of Clinical Chemistry and Laboratory Medicine, Universitätsklinikum Carl Gustav Carus, Dresden, Germany
| | - Paolo Mulatero
- Division of Internal Medicine and Hypertension Unit, Department of Medical Sciences, University of Torino, Italy
| | - Martin Reincke
- Medizinische Klinik und Poliklinik IV, Klinikum der Universität München, LMU München, Munich, Germany
| | - Livia Lenzini
- Clinica dell’Ipertensione Arteriosa, Department of Medicine-DIMED, University of Padua, Padua
| | - Anne-Paule Gimenez-Roqueplo
- Université de Paris, PARCC, INSERM, Paris, France
- Assistance Publique-Hôpitaux de Paris, Hôpital Européen Georges Pompidou, Service de Génétique, Paris, France
| | - Guillaume Assié
- Université de Paris, Institut Cochin, INSERM, CNRS, PARIS, France
- Department of Endocrinology, Center for Rare Adrenal Diseases, AP-HP, Hôpital Cochin, Paris, France
- Department of Endocrinology, Center for Rare Adrenal Diseases, Assistance Publique–Hôpitaux de Paris, Hôpital Cochin, Paris, France
| | - Anne Blanchard
- Assistance Publique-Hôpitaux de Paris, Hôpital Européen Georges Pompidou, Centre d’Investigations Cliniques 9201, Paris, France
| | - Maria Christina Zennaro
- Université de Paris, PARCC, INSERM, Paris, France
- Assistance Publique-Hôpitaux de Paris, Hôpital Européen Georges Pompidou, Service de Génétique, Paris, France
| | - Emily Jefferson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Felix Beuschlein
- Klinik für Endokrinologie, Diabetologie und Klinische Ernährung, UniversitätsSpital Zürich, Zurich, Switzerland
- Medizinische Klinik und Poliklinik IV, Klinikum der Universität München, LMU München, Munich, Germany
| |
Collapse
|
9
|
Masoudi-Sobhanzadeh Y, Motieghader H, Omidi Y, Masoudi-Nejad A. A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications. Sci Rep 2021; 11:3349. [PMID: 33558580 PMCID: PMC7870651 DOI: 10.1038/s41598-021-82796-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2020] [Accepted: 01/25/2021] [Indexed: 01/30/2023] Open
Abstract
Gene/feature selection is an essential preprocessing step for creating models using machine learning techniques. It also plays a critical role in different biological applications such as the identification of biomarkers. Although many feature/gene selection algorithms and methods have been introduced, they may suffer from problems such as parameter tuning or low level of performance. To tackle such limitations, in this study, a universal wrapper approach is introduced based on our introduced optimization algorithm and the genetic algorithm (GA). In the proposed approach, candidate solutions have variable lengths, and a support vector machine scores them. To show the usefulness of the method, thirteen classification and regression-based datasets with different properties were chosen from various biological scopes, including drug discovery, cancer diagnostics, clinical applications, etc. Our findings confirmed that the proposed method outperforms most of the other currently used approaches and can also free the users from difficulties related to the tuning of various parameters. As a result, users may optimize their biological applications such as obtaining a biomarker diagnostic kit with the minimum number of genes and maximum separability power.
Collapse
Affiliation(s)
- Yosef Masoudi-Sobhanzadeh
- grid.412888.f0000 0001 2174 8913Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Habib Motieghader
- grid.459617.80000 0004 0494 2783Department of Bioinformatics, Biotechnology Research Center, Tabriz Branch, Islamic Azad University, Tabriz, Iran ,grid.459617.80000 0004 0494 2783Department of Basic Sciences, Gowgan Educational Center, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Yadollah Omidi
- grid.261241.20000 0001 2168 8324Department of Pharmaceutical Sciences, College of Pharmacy, Nova Southeastern University, Fort Lauderdale, Florida, 33328 USA
| | - Ali Masoudi-Nejad
- grid.46072.370000 0004 0612 7950Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|
10
|
Benedetti F, Poletti S, Vai B, Mazza MG, Lorenzi C, Brioschi S, Aggio V, Branchi I, Colombo C, Furlan R, Zanardi R. Higher baseline interleukin-1β and TNF-α hamper antidepressant response in major depressive disorder. Eur Neuropsychopharmacol 2021; 42:35-44. [PMID: 33191075 DOI: 10.1016/j.euroneuro.2020.11.009] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 10/18/2020] [Accepted: 11/06/2020] [Indexed: 01/06/2023]
Abstract
Raised pro-inflammatory immune/inflammatory setpoints, leading to an increased production of peripheral cytokines, have been associated with Major Depressive Disorder (MDD) and with failure to respond to first-line antidepressant drugs. However, the usefulness of these biomarkers in clinical psychopharmacology has been questioned because single findings did not translate into the clinical practice, where patients are prescribed treatments upon clinical need. We studied a panel of 27 inflammatory biomarkers in a sample of 108 inpatients with MDD, treated with antidepressant monotherapy for 4 weeks upon clinical need in a specialized hospital setting, and assessed the predictive effect of baseline peripheral measures of inflammation on antidepressing efficacy (response rates and time-lagged pattern of decrease of depression severity) using a machine-learning approach with elastic net penalized regression, and multivariate analyses in the context of the general linear model. When considering both categorical and continuous measures of response, baseline levels of IL-1β predicted non-response to antidepressants, with the predicted probability to respond being highly dispersed at low levels of IL-1β, and stratifying toward non-response when IL-1β is high. Significant negative effects were also detected for TNF-α, while IL-12 weakly predicted response. These findings support the usefulness of inflammatory biomarkers in the clinical psychopharmacology of depression, and add to ongoing research efforts aiming at defining reliable cutoff values to identify depressed patients in clinical settings with high inflammation, and low probability to respond.
Collapse
Affiliation(s)
- Francesco Benedetti
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy.
| | - Sara Poletti
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy
| | - Benedetta Vai
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy; Fondazione Centro San Raffaele, Milano, Italy
| | - Mario Gennaro Mazza
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy
| | - Cristina Lorenzi
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy
| | - Silvia Brioschi
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy
| | - Veronica Aggio
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy
| | - Igor Branchi
- Center for Behavioral Sciences and Mental Health, Istituto Superiore di Sanità, Rome, Italy
| | - Cristina Colombo
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy
| | - Roberto Furlan
- Vita-Salute San Raffaele University, Milano, Italy; Clinical Neuroimmunology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy
| | - Raffaella Zanardi
- Psychiatry and Clinical Psychobiology, Division of Neuroscience, IRCCS Scientific Institute Ospedale San Raffaele, Milano, Italy; Vita-Salute San Raffaele University, Milano, Italy
| |
Collapse
|
11
|
TCox: Correlation-Based Regularization Applied to Colorectal Cancer Survival Data. Biomedicines 2020; 8:biomedicines8110488. [PMID: 33182598 PMCID: PMC7696515 DOI: 10.3390/biomedicines8110488] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 10/26/2020] [Accepted: 11/06/2020] [Indexed: 01/29/2023] Open
Abstract
Colorectal cancer (CRC) is one of the leading causes of mortality and morbidity in the world. Being a heterogeneous disease, cancer therapy and prognosis represent a significant challenge to medical care. The molecular information improves the accuracy with which patients are classified and treated since similar pathologies may show different clinical outcomes and other responses to treatment. However, the high dimensionality of gene expression data makes the selection of novel genes a problematic task. We propose TCox, a novel penalization function for Cox models, which promotes the selection of genes that have distinct correlation patterns in normal vs. tumor tissues. We compare TCox to other regularized survival models, Elastic Net, HubCox, and OrphanCox. Gene expression and clinical data of CRC and normal (TCGA) patients are used for model evaluation. Each model is tested 100 times. Within a specific run, eighteen of the features selected by TCox are also selected by the other survival regression models tested, therefore undoubtedly being crucial players in the survival of colorectal cancer patients. Moreover, the TCox model exclusively selects genes able to categorize patients into significant risk groups. Our work demonstrates the ability of the proposed weighted regularizer TCox to disclose novel molecular drivers in CRC survival by accounting for correlation-based network information from both tumor and normal tissue. The results presented support the relevance of network information for biomarker identification in high-dimensional gene expression data and foster new directions for the development of network-based feature selection methods in precision oncology.
Collapse
|
12
|
Ennour-Idrissi K, Dragic D, Issa E, Michaud A, Chang SL, Provencher L, Durocher F, Diorio C. DNA Methylation and Breast Cancer Risk: An Epigenome-Wide Study of Normal Breast Tissue and Blood. Cancers (Basel) 2020; 12:cancers12113088. [PMID: 33113958 PMCID: PMC7690691 DOI: 10.3390/cancers12113088] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 10/07/2020] [Accepted: 10/19/2020] [Indexed: 02/07/2023] Open
Abstract
Differential DNA methylation is a potential marker of breast cancer risk. Few studies have investigated DNA methylation changes in normal breast tissue and were largely confounded by cancer field effects. To detect methylation changes in normal breast epithelium that are causally associated with breast cancer occurrence, we used a nested case-control study design based on a prospective cohort of patients diagnosed with a primary invasive hormone receptor-positive breast cancer. Twenty patients diagnosed with a contralateral breast cancer (CBC) were matched (1:1) with 20 patients who did not develop a CBC on relevant risk factors. Differentially methylated Cytosine-phosphate-Guanines (CpGs) and regions in normal breast epithelium were identified using an epigenome-wide DNA methylation assay and robust linear regressions. Analyses were replicated in two independent sets of normal breast tissue and blood. We identified 7315 CpGs (FDR < 0.05), 52 passing strict Bonferroni correction (p < 1.22 × 10-7) and 43 mapping to known genes involved in metabolic diseases with significant enrichment (p < 0.01) of pathways involving fatty acids metabolic processes. Four differentially methylated genes were detected in both site-specific and regions analyses (LHX2, TFAP2B, JAKMIP1, SEPT9), and three genes overlapped all three datasets (POM121L2, KCNQ1, CLEC4C). Once validated, the seven differentially methylated genes distinguishing women who developed and who did not develop a sporadic breast cancer could be used to enhance breast cancer risk-stratification, and allow implementation of targeted screening and preventive strategies that would ultimately improve breast cancer prognosis.
Collapse
Affiliation(s)
- Kaoutar Ennour-Idrissi
- Département de Médecine Sociale et Préventive, Faculté de Médecine, Université Laval, Québec, QC G1V 0A6, Canada; (K.E.-I.); (D.D.)
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
- Département de Biologie Moléculaire, de Biochimie Médicale et de Pathologie de l’Université Laval, Québec, QC G1V 0A6, Canada
| | - Dzevka Dragic
- Département de Médecine Sociale et Préventive, Faculté de Médecine, Université Laval, Québec, QC G1V 0A6, Canada; (K.E.-I.); (D.D.)
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
| | - Elissar Issa
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
- Département de Médecine Moléculaire, Faculté de Médecine, Université Laval, Québec, QC G1V 0A6, Canada
| | - Annick Michaud
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
| | - Sue-Ling Chang
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
| | - Louise Provencher
- Centre des Maladies du sein du CHU de Québec-Université Laval, Québec, QC G1S 4L8, Canada;
| | - Francine Durocher
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
- Département de Médecine Moléculaire, Faculté de Médecine, Université Laval, Québec, QC G1V 0A6, Canada
| | - Caroline Diorio
- Département de Médecine Sociale et Préventive, Faculté de Médecine, Université Laval, Québec, QC G1V 0A6, Canada; (K.E.-I.); (D.D.)
- Centre de Recherche sur le Cancer, Centre de Recherche du CHU de Québec-Université Laval, Québec, QC G1R 3S3, Canada; (E.I.); (A.M.); (S.-L.C.); (F.D.)
- Centre des Maladies du sein du CHU de Québec-Université Laval, Québec, QC G1S 4L8, Canada;
- Correspondence: ; Tel.: +1-418-682-7511-84726
| |
Collapse
|
13
|
Robust high-dimensional regression for data with anomalous responses. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00764-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
14
|
Sun H, Cui Y, Wang H, Liu H, Wang T. Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data. BMC Bioinformatics 2020; 21:357. [PMID: 32795265 PMCID: PMC7646480 DOI: 10.1186/s12859-020-03653-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 07/10/2020] [Indexed: 02/08/2023] Open
Abstract
Background Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen. Results The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble. Conclusions When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.
Collapse
Affiliation(s)
- Hongwei Sun
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.,Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA
| | - Hui Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China
| | - Haixia Liu
- Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China
| | - Tong Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.
| |
Collapse
|
15
|
Chen X, Zhang B, Wang T, Bonni A, Zhao G. Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics 2020; 21:269. [PMID: 32600248 PMCID: PMC7324992 DOI: 10.1186/s12859-020-03608-0] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Accepted: 06/16/2020] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. RESULTS We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. CONCLUSIONS rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Collapse
Affiliation(s)
- Xiaoying Chen
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA
| | - Bo Zhang
- Center of Regenerative Medicine, Department of Developmental Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Ting Wang
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Azad Bonni
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA
| | - Guoyan Zhao
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
16
|
A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection. JOURNAL OF DATA AND INFORMATION SCIENCE 2020. [DOI: 10.2478/jdis-2020-0014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Abstract
Purpose
The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately. To serve this purpose, a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.
Design/methodology/approach
This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems. The proposed approach, named Bagged and Voted Local Outlier Detection (BV-LOF), benefits from the Local Outlier Factor (LOF) as the base algorithm and improves its detection rate by using ensemble methods.
Findings
Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method. According to the results, the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.
Research limitations
In the BV-LOF approach, the base algorithm is applied to each subset data multiple times with different neighborhood sizes (k) in each case and with different ensemble sizes (T). In our study, we have chosen k and T value ranges as [1–100]; however, these ranges can be changed according to the dataset handled and to the problem addressed.
Practical implications
The proposed method can be applied to the datasets from different domains (i.e. health, finance, manufacturing, etc.) without requiring any prior information. Since the BV-LOF method includes two-level ensemble operations, it may lead to more computational time than single-level ensemble methods; however, this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.
Originality/value
The proposed approach (BV-LOF) investigates multiple neighborhood sizes (k), which provides findings of instances with different local densities, and in this way, it provides more likelihood of outlier detection that LOF may neglect. It also brings many benefits such as easy implementation, improved capability, higher applicability, and interpretability.
Collapse
|
17
|
Lopes MB, Casimiro S, Vinga S. Twiner: correlation-based regularization for identifying common cancer gene signatures. BMC Bioinformatics 2019; 20:356. [PMID: 31238876 PMCID: PMC6593597 DOI: 10.1186/s12859-019-2937-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Accepted: 06/06/2019] [Indexed: 12/27/2022] Open
Abstract
Background Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics. Results We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure. Conclusions Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.
Collapse
Affiliation(s)
- Marta B Lopes
- Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, 1049-001, Portugal. .,INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.
| | - Sandra Casimiro
- Luis Costa Lab, Instituto de Medicina Molecular, Faculdade de Medicina da Universidade de Lisboa, Avenida Professor Egas Moniz, Lisboa, 1649-028, Portugal
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.,IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Lisboa, 1049-001, Portugal
| |
Collapse
|
18
|
Sun L, Kong X, Xu J, Xue Z, Zhai R, Zhang S. A Hybrid Gene Selection Method Based on ReliefF and Ant Colony Optimization Algorithm for Tumor Classification. Sci Rep 2019; 9:8978. [PMID: 31222027 PMCID: PMC6586811 DOI: 10.1038/s41598-019-45223-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 06/04/2019] [Indexed: 12/20/2022] Open
Abstract
For the DNA microarray datasets, tumor classification based on gene expression profiles has drawn great attention, and gene selection plays a significant role in improving the classification performance of microarray data. In this study, an effective hybrid gene selection method based on ReliefF and Ant colony optimization (ACO) algorithm for tumor classification is proposed. First, for the ReliefF algorithm, the average distance among k nearest or k non-nearest neighbor samples are introduced to estimate the difference among samples, based on which the distances between the samples in the same class or the different classes are defined, and then it can more effectively evaluate the weight values of genes for samples. To obtain the stable results in emergencies, a distance coefficient is developed to construct a new formula of updating weight coefficient of genes to further reduce the instability during calculations. When decreasing the distance between the same samples and increasing the distance between the different samples, the weight division is more obvious. Thus, the ReliefF algorithm can be improved to reduce the initial dimensionality of gene expression datasets and obtain a candidate gene subset. Second, a new pruning rule is designed to reduce dimensionality and obtain a new candidate subset with the smaller number of genes. The probability formula of the next point in the path selected by the ants is presented to highlight the closeness of the correlation relationship between the reaction variables. To increase the pheromone concentration of important genes, a new phenotype updating formula of the ACO algorithm is adopted to prevent the pheromone left by the ants that are overwhelmed with time, and then the weight coefficients of the genes are applied here to eliminate the interference of difference data as much as possible. It follows that the improved ACO algorithm has the ability of the strong positive feedback, which quickly converges to an optimal solution through the accumulation and the updating of pheromone. Finally, by combining the improved ReliefF algorithm and the improved ACO method, a hybrid filter-wrapper-based gene selection algorithm called as RFACO-GS is proposed. The experimental results under several public gene expression datasets demonstrate that the proposed method is very effective, which can significantly reduce the dimensionality of gene expression datasets, and select the most relevant genes with high classification accuracy.
Collapse
Affiliation(s)
- Lin Sun
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China.
- Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China.
| | - Xianglin Kong
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Jiucheng Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China.
- Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China.
| | - Zhan'ao Xue
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Ruibing Zhai
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Shiguang Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China
| |
Collapse
|
19
|
Segaert P, Lopes MB, Casimiro S, Vinga S, Rousseeuw PJ. Robust identification of target genes and outliers in triple-negative breast cancer data. Stat Methods Med Res 2018; 28:3042-3056. [PMID: 30146936 PMCID: PMC6745616 DOI: 10.1177/0962280218794722] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Correct classification of breast cancer subtypes is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma transcriptomic data publicly available from The Cancer Genome Atlas data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail to identify outliers due to their heavy influence, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60% have been previously reported as biologically relevant to triple-negative breast cancer, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for triple-negative breast cancer. Out of these, JAM3, SFT2D2, and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between triple-negative breast cancer and non-triple-negative breast cancer data. The individual role of FOXA1 in triple-negative breast cancer and non-triple-negative breast cancer, and the strong FOXA1-AGR2 connection in triple-negative breast cancer stand out. The goal of our paper is to contribute to the breast cancer/triple-negative breast cancer understanding and management. At the same time it demonstrates that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data.
Collapse
Affiliation(s)
| | - Marta B Lopes
- IDMEC, Instituto de Engenharia Mecânica, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Sandra Casimiro
- Luís Costa Lab, Instituto de Medicina Molecular, Faculdade de Medicina da Universidade de Lisboa, Lisboa, Portugal
| | - Susana Vinga
- IDMEC, Instituto de Engenharia Mecânica, Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal.,INESC-ID, Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento, Lisboa, Portugal
| | | |
Collapse
|