1
|
Yang C, Camargo Tavares L, Lee HC, Steele JR, Ribeiro RV, Beale AL, Yiallourou S, Carrington MJ, Kaye DM, Head GA, Schittenhelm RB, Marques FZ. Faecal metaproteomics analysis reveals a high cardiovascular risk profile across healthy individuals and heart failure patients. Gut Microbes 2025; 17:2441356. [PMID: 39709554 DOI: 10.1080/19490976.2024.2441356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 11/18/2024] [Accepted: 11/26/2024] [Indexed: 12/23/2024] Open
Abstract
The gut microbiota is a crucial link between diet and cardiovascular disease (CVD). Using fecal metaproteomics, a method that concurrently captures human gut and microbiome proteins, we determined the crosstalk between gut microbiome, diet, gut health, and CVD. Traditional CVD risk factors (age, BMI, sex, blood pressure) explained < 10% of the proteome variance. However, unsupervised human protein-based clustering analysis revealed two distinct CVD risk clusters (low-risk and high-risk) with different blood pressure (by 9 mmHg) and sex-dependent dietary potassium and fiber intake. In the human proteome, the low-risk group had lower angiotensin-converting enzymes, inflammatory proteins associated with neutrophil extracellular trap formation and auto-immune diseases. In the microbial proteome, the low-risk group had higher expression of phosphate acetyltransferase that produces SCFAs, particularly in fiber-fermenting bacteria. This model identified severity across phenotypes in heart failure patients and long-term risk of cardiovascular events in a large population-based cohort. These findings underscore multifactorial gut-to-host mechanisms that may underlie risk factors for CVD.
Collapse
Affiliation(s)
- Chaoran Yang
- Hypertension Research Laboratory, School of Biological Sciences, Faculty of Science, Monash, Clayton, Australia
| | - Leticia Camargo Tavares
- Hypertension Research Laboratory, School of Biological Sciences, Faculty of Science, Monash, Clayton, Australia
| | - Han-Chung Lee
- Monash Proteomics & Metabolomics Platform, Monash Biomedicine Discovery Institute & Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Joel R Steele
- Monash Proteomics & Metabolomics Platform, Monash Biomedicine Discovery Institute & Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | | | - Anna L Beale
- Heart Failure Research Laboratory, Baker Heart and Diabetes Institute, Melbourne, Australia
- Department of Cardiology, Alfred Hospital, Melbourne, Australia
| | - Stephanie Yiallourou
- Preclinical Disease and Prevention Unit, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - Melinda J Carrington
- Preclinical Disease and Prevention Unit, Baker Heart and Diabetes Institute, Melbourne, Australia
| | - David M Kaye
- Heart Failure Research Laboratory, Baker Heart and Diabetes Institute, Melbourne, Australia
- Department of Cardiology, Alfred Hospital, Melbourne, Australia
- School of Translational Medicine, Faculty of Medicine Nursing and Health Sciences, Monash University, Melbourne, Australia
| | - Geoffrey A Head
- Neuropharmacology Laboratory, Baker Heart and Diabetes Institute, Melbourne, Australia
- Department of Pharmacology, Faculty of Medicine Nursing and Health Sciences, Monash University, Melbourne, Australia
| | - Ralf B Schittenhelm
- Monash Proteomics & Metabolomics Platform, Monash Biomedicine Discovery Institute & Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Francine Z Marques
- Hypertension Research Laboratory, School of Biological Sciences, Faculty of Science, Monash, Clayton, Australia
- Heart Failure Research Laboratory, Baker Heart and Diabetes Institute, Melbourne, Australia
- Victorian Heart Institute, Monash University, Clayton, Australia
| |
Collapse
|
2
|
Wang L, Su J, Liu Z, Ding S, Li Y, Hou B, Hu Y, Dong Z, Tang J, Liu H, Liu W. Identification of immune-associated biomarkers of diabetes nephropathy tubulointerstitial injury based on machine learning: a bioinformatics multi-chip integrated analysis. BioData Min 2024; 17:20. [PMID: 38951833 PMCID: PMC11218417 DOI: 10.1186/s13040-024-00369-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 06/10/2024] [Indexed: 07/03/2024] Open
Abstract
BACKGROUND Diabetic nephropathy (DN) is a major microvascular complication of diabetes and has become the leading cause of end-stage renal disease worldwide. A considerable number of DN patients have experienced irreversible end-stage renal disease progression due to the inability to diagnose the disease early. Therefore, reliable biomarkers that are helpful for early diagnosis and treatment are identified. The migration of immune cells to the kidney is considered to be a key step in the progression of DN-related vascular injury. Therefore, finding markers in this process may be more helpful for the early diagnosis and progression prediction of DN. METHODS The gene chip data were retrieved from the GEO database using the search term ' diabetic nephropathy '. The ' limma ' software package was used to identify differentially expressed genes (DEGs) between DN and control samples. Gene set enrichment analysis (GSEA) was performed on genes obtained from the molecular characteristic database (MSigDB. The R package 'WGCNA' was used to identify gene modules associated with tubulointerstitial injury in DN, and it was crossed with immune-related DEGs to identify target genes. Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed on differentially expressed genes using the 'ClusterProfiler' software package in R. Three methods, least absolute shrinkage and selection operator (LASSO), support vector machine recursive feature elimination (SVM-RFE) and random forest (RF), were used to select immune-related biomarkers for diagnosis. We retrieved the tubulointerstitial dataset from the Nephroseq database to construct an external validation dataset. Unsupervised clustering analysis of the expression levels of immune-related biomarkers was performed using the 'ConsensusClusterPlus 'R software package. The urine of patients who visited Dongzhimen Hospital of Beijing University of Chinese Medicine from September 2021 to March 2023 was collected, and Elisa was used to detect the mRNA expression level of immune-related biomarkers in urine. Pearson correlation analysis was used to detect the effect of immune-related biomarker expression on renal function in DN patients. RESULTS Four microarray datasets from the GEO database are included in the analysis : GSE30122, GSE47185, GSE99340 and GSE104954. These datasets included 63 DN patients and 55 healthy controls. A total of 9415 genes were detected in the data set. We found 153 differentially expressed immune-related genes, of which 112 genes were up-regulated, 41 genes were down-regulated, and 119 overlapping genes were identified. GO analysis showed that they were involved in various biological processes including leukocyte-mediated immunity. KEGG analysis showed that these target genes were mainly involved in the formation of phagosomes in Staphylococcus aureus infection. Among these 119 overlapping genes, machine learning results identified AGR2, CCR2, CEBPD, CISH, CX3CR1, DEFB1 and FSTL1 as potential tubulointerstitial immune-related biomarkers. External validation suggested that the above markers showed diagnostic efficacy in distinguishing DN patients from healthy controls. Clinical studies have shown that the expression of AGR2, CX3CR1 and FSTL1 in urine samples of DN patients is negatively correlated with GFR, the expression of CX3CR1 and FSTL1 in urine samples of DN is positively correlated with serum creatinine, while the expression of DEFB1 in urine samples of DN is negatively correlated with serum creatinine. In addition, the expression of CX3CR1 in DN urine samples was positively correlated with proteinuria, while the expression of DEFB1 in DN urine samples was negatively correlated with proteinuria. Finally, according to the level of proteinuria, DN patients were divided into nephrotic proteinuria group (n = 24) and subrenal proteinuria group. There were significant differences in urinary AGR2, CCR2 and DEFB1 between the two groups by unpaired t test (P < 0.05). CONCLUSIONS Our study provides new insights into the role of immune-related biomarkers in DN tubulointerstitial injury and provides potential targets for early diagnosis and treatment of DN patients. Seven different genes ( AGR2, CCR2, CEBPD, CISH, CX3CR1, DEFB1, FSTL1 ), as promising sensitive biomarkers, may affect the progression of DN by regulating immune inflammatory response. However, further comprehensive studies are needed to fully understand their exact molecular mechanisms and functional pathways in DN.
Collapse
Affiliation(s)
- Lin Wang
- Key Laboratory of Chinese Internal Medicine of Ministry of Education and Beijing, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Jiaming Su
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Zhongjie Liu
- Beijing University of Chinese Medicine, Beijing, China
| | - Shaowei Ding
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Yaotan Li
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Baoluo Hou
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Yuxin Hu
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Zhaoxi Dong
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Jingyi Tang
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China
- Beijing University of Chinese Medicine, Beijing, China
| | - Hongfang Liu
- Key Laboratory of Chinese Internal Medicine of Ministry of Education and Beijing, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China.
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China.
| | - Weijing Liu
- Key Laboratory of Chinese Internal Medicine of Ministry of Education and Beijing, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China.
- Renal Research Institution of Beijing University of Chinese Medicine, Dongzhimen Hospital, Affiliated to Beijing University of Chinese Medicine, Beijing, China.
- Beijing University of Chinese Medicine, Beijing, China.
| |
Collapse
|
3
|
Lange E, Kranert L, Krüger J, Benndorf D, Heyer R. Microbiome modeling: a beginner's guide. Front Microbiol 2024; 15:1368377. [PMID: 38962127 PMCID: PMC11220171 DOI: 10.3389/fmicb.2024.1368377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 05/27/2024] [Indexed: 07/05/2024] Open
Abstract
Microbiomes, comprised of diverse microbial species and viruses, play pivotal roles in human health, environmental processes, and biotechnological applications and interact with each other, their environment, and hosts via ecological interactions. Our understanding of microbiomes is still limited and hampered by their complexity. A concept improving this understanding is systems biology, which focuses on the holistic description of biological systems utilizing experimental and computational methods. An important set of such experimental methods are metaomics methods which analyze microbiomes and output lists of molecular features. These lists of data are integrated, interpreted, and compiled into computational microbiome models, to predict, optimize, and control microbiome behavior. There exists a gap in understanding between microbiologists and modelers/bioinformaticians, stemming from a lack of interdisciplinary knowledge. This knowledge gap hinders the establishment of computational models in microbiome analysis. This review aims to bridge this gap and is tailored for microbiologists, researchers new to microbiome modeling, and bioinformaticians. To achieve this goal, it provides an interdisciplinary overview of microbiome modeling, starting with fundamental knowledge of microbiomes, metaomics methods, common modeling formalisms, and how models facilitate microbiome control. It concludes with guidelines and repositories for modeling. Each section provides entry-level information, example applications, and important references, serving as a valuable resource for comprehending and navigating the complex landscape of microbiome research and modeling.
Collapse
Affiliation(s)
- Emanuel Lange
- Multidimensional Omics Data Analysis, Department for Bioanalytics, Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany
- Graduate School Digital Infrastructure for the Life Sciences, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Lena Kranert
- Institute for Automation Engineering, Otto von Guericke University Magdeburg, Magdeburg, Germany
| | - Jacob Krüger
- Engineering of Software-Intensive Systems, Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, Netherlands
| | - Dirk Benndorf
- Applied Biosciences and Bioprocess Engineering, Anhalt University of Applied Sciences, Köthen, Germany
| | - Robert Heyer
- Multidimensional Omics Data Analysis, Department for Bioanalytics, Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany
- Graduate School Digital Infrastructure for the Life Sciences, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Multidimensional Omics Data Analysis, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
4
|
He H, Duo H, Hao Y, Zhang X, Zhou X, Zeng Y, Li Y, Li B. Computational drug repurposing by exploiting large-scale gene expression data: Strategy, methods and applications. Comput Biol Med 2023; 155:106671. [PMID: 36805225 DOI: 10.1016/j.compbiomed.2023.106671] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Revised: 02/05/2023] [Accepted: 02/10/2023] [Indexed: 02/18/2023]
Abstract
De novo drug development is an extremely complex, time-consuming and costly task. Urgent needs for therapies of various diseases have greatly accelerated searches for more effective drug development methods. Luckily, drug repurposing provides a new and effective perspective on disease treatment. Rapidly increased large-scale transcriptome data paints a detailed prospect of gene expression during disease onset and thus has received wide attention in the field of computational drug repurposing. However, how to efficiently mine transcriptome data and identify new indications for old drugs remains a critical challenge. This review discussed the irreplaceable role of transcriptome data in computational drug repurposing and summarized some representative databases, tools and strategies. More importantly, it proposed a practical guideline through establishing the correspondence between three gene expression data types and five strategies, which would facilitate researchers to adopt appropriate strategies to deeply mine large-scale transcriptome data and discover more effective therapies.
Collapse
Affiliation(s)
- Hao He
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China; State Key Laboratory of Medical Neurobiology and MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University, Shanghai, 200032, PR China
| | - Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Xinyi Zhou
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Yujie Zeng
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Yinghong Li
- The Key Laboratory on Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, PR China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China.
| |
Collapse
|
5
|
Alshawaqfeh M, Rababah S, Hayajneh A, Gharaibeh A, Serpedin E. MetaAnalyst: a user-friendly tool for metagenomic biomarker detection and phenotype classification. BMC Med Res Methodol 2022; 22:336. [PMID: 36577938 PMCID: PMC9795700 DOI: 10.1186/s12874-022-01812-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 11/28/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Many metagenomic studies have linked the imbalance in microbial abundance profiles to a wide range of diseases. These studies suggest utilizing the microbial abundance profiles as potential markers for metagenomic-associated conditions. Due to the inevitable importance of biomarkers in understanding the disease progression and the development of possible therapies, various computational tools have been proposed for metagenomic biomarker detection. However, most existing tools require prior scripting knowledge and lack user friendly interfaces, causing considerable time and effort to install, configure, and run these tools. Besides, there is no available all-in-one solution for running and comparing various metagenomic biomarker detection simultaneously. In addition, most of these tools just present the suggested biomarkers without any statistical evaluation for their quality. RESULTS To overcome these limitations, this work presents MetaAnalyst, a software package with a simple graphical user interface (GUI) that (i) automates the installation and configuration of 28 state-of-the-art tools, (ii) supports flexible study design to enable studying the dataset under different scenarios smoothly, iii) runs and evaluates several algorithms simultaneously iv) supports different input formats and provides the user with several preprocessing capabilities, v) provides a variety of metrics to evaluate the quality of the suggested markers, and vi) presents the outcomes in the form of publication quality plots with various formatting capabilities as well as Excel sheets. CONCLUSIONS The utility of this tool has been verified through studying a metagenomic dataset under four scenarios. The executable file for MetaAnalyst along with its user manual are made available at https://github.com/mshawaqfeh/MetaAnalyst .
Collapse
Affiliation(s)
- Mustafa Alshawaqfeh
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan
| | - Salahelden Rababah
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan ,grid.264260.40000 0001 2164 4508Department of Systems Science and Industrial Engineering, State University of New York at Binghamton, Binghamton, NY, USA
| | - Abdullah Hayajneh
- grid.264756.40000 0004 4687 2082Electrical and Computer Engineering Department, Texas A &M University, College Station, TX, USA
| | - Ammar Gharaibeh
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan
| | - Erchin Serpedin
- grid.264756.40000 0004 4687 2082Electrical and Computer Engineering Department, Texas A &M University, College Station, TX, USA
| |
Collapse
|
6
|
Vijayan A, Fatima S, Sowmya A, Vafaee F. Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods. Brief Bioinform 2022; 23:6658855. [PMID: 35945147 DOI: 10.1093/bib/bbac315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/11/2022] [Accepted: 07/12/2022] [Indexed: 11/13/2022] Open
Abstract
Liquid biopsy has shown promise for cancer diagnosis due to its minimally invasive nature and the potential for novel biomarker discovery. However, the low concentration of relevant blood-based biosources and the heterogeneity of samples (i.e. the variability of relative abundance of molecules identified), pose major challenges to biomarker discovery. Moreover, the number of molecular measurements or features (e.g. transcript read counts) per sample could be in the order of several thousand, whereas the number of samples is often substantially lower, leading to the curse of dimensionality. These challenges, among others, elucidate the importance of a robust biomarker panel identification or feature extraction step wherein relevant molecular measurements are identified prior to classification for cancer detection. In this work, we performed a benchmarking study on 12 feature extraction methods using transcriptomic profiles derived from different blood-based biosources. The methods were assessed both in terms of their predictive performance and the robustness of the biomarker panels in diagnosing cancer or stratifying cancer subtypes. While performing the comparison, the feature extraction methods are categorized into feature subset selection methods and transformation methods. A transformation feature extraction method, namely partial least square discriminant analysis, was found to perform consistently superior in terms of classification performance. As part of the benchmarking study, a generic pipeline has been created and made available as an R package to ensure reproducibility of the results and allow for easy extension of this study to other datasets (https://github.com/VafaeeLab/bloodbased-pancancer-diagnosis).
Collapse
Affiliation(s)
- Abhishek Vijayan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Shadma Fatima
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,Ingham Institute, NSW, Australia
| | - Arcot Sowmya
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia.,UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| |
Collapse
|
7
|
An ensemble framework for microarray data classification based on feature subspace partitioning. Comput Biol Med 2022; 148:105820. [PMID: 35872409 DOI: 10.1016/j.compbiomed.2022.105820] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 06/05/2022] [Accepted: 07/03/2022] [Indexed: 12/14/2022]
Abstract
Feature selection is exposed to the curse of dimensionality risk, and it is even more exacerbated with high-dimensional data such as microarrays. Moreover, the low-instance/high-feature (LIHF) property of microarray data needs considerable processing time to do some calculations and comparisons among features to choose the best subset of them, which has led to many efforts to subdue the LIHF property of such genomic medicine data. Due to the promising results of the ensemble models in machine learning problems, this paper presents a novel framework, named feature-level aggregation-based ensemble based on overlapped feature subspace partitioning (FLAE-OFSP) for microarray data classification. The proposed ensemble has three main steps: after generating several subsets by the proposed partitioning approach, a feature selection algorithm (i.e., a feature ranker) is applied on each subset, and finally, their results are combined into a single ranked list using six defined aggregation functions. Evaluation of the presented framework based on seven microarray datasets and using four measures, including stability, classification accuracy, runtime, and Modscore shows substantial runtime improvement and also quality results in other evaluated measures compared to individual methods.
Collapse
|
8
|
Plancade S, Berland M, Blein-Nicolas M, Langella O, Bassignani A, Juste C. A combined test for feature selection on sparse metaproteomics data-an alternative to missing value imputation. PeerJ 2022; 10:e13525. [PMID: 35769140 PMCID: PMC9235818 DOI: 10.7717/peerj.13525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 05/11/2022] [Indexed: 01/18/2023] Open
Abstract
One of the difficulties encountered in the statistical analysis of metaproteomics data is the high proportion of missing values, which are usually treated by imputation. Nevertheless, imputation methods are based on restrictive assumptions regarding missingness mechanisms, namely "at random" or "not at random". To circumvent these limitations in the context of feature selection in a multi-class comparison, we propose a univariate selection method that combines a test of association between missingness and classes, and a test for difference of observed intensities between classes. This approach implicitly handles both missingness mechanisms. We performed a quantitative and qualitative comparison of our procedure with imputation-based feature selection methods on two experimental data sets, as well as simulated data with various scenarios regarding the missingness mechanisms and the nature of the difference of expression (differential intensity or differential presence). Whereas we observed similar performances in terms of prediction on the experimental data set, the feature ranking and selection from various imputation-based methods were strongly divergent. We showed that the combined test reaches a compromise by correlating reasonably with other methods, and remains efficient in all simulated scenarios unlike imputation-based feature selection methods.
Collapse
Affiliation(s)
- Sandra Plancade
- UR875 MIAT, Université fédérale de Toulouse, INRAE, Castanet-Tolosan, France
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy en Josas, France
| | - Mélisande Blein-Nicolas
- Université Paris-Saclay, CNRS, INRAE, AgroParisTech, GQE-Le Moulon, Gif-sur-Yvette, France,Université Paris-Saclay, CNRS, INRAE, AgroParisTech, PAPPSO, Gif-sur-Yvette, France
| | - Olivier Langella
- Université Paris-Saclay, CNRS, INRAE, AgroParisTech, GQE-Le Moulon, Gif-sur-Yvette, France,Université Paris-Saclay, CNRS, INRAE, AgroParisTech, PAPPSO, Gif-sur-Yvette, France
| | - Ariane Bassignani
- Université Paris-Saclay, INRAE, MGP, Jouy en Josas, France,Université Paris-Saclay, CNRS, INRAE, AgroParisTech, PAPPSO, Gif-sur-Yvette, France
| | - Catherine Juste
- Micalis Institute, Université Paris-Saclay, INRAE, AgroParis Tech, Jouy-en-Josas, France
| |
Collapse
|
9
|
Understanding the mutational frequency in SARS-CoV-2 proteome using structural features. Comput Biol Med 2022; 147:105708. [PMID: 35714506 PMCID: PMC9173821 DOI: 10.1016/j.compbiomed.2022.105708] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 04/26/2022] [Accepted: 06/04/2022] [Indexed: 01/18/2023]
Abstract
The prolonged transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus in the human population has led to demographic divergence and the emergence of several location-specific clusters of viral strains. Although the effect of mutation(s) on severity and survival of the virus is still unclear, it is evident that certain sites in the viral proteome are more/less prone to mutations. In fact, millions of SARS-CoV-2 sequences collected all over the world have provided us a unique opportunity to understand viral protein mutations and develop novel computational approaches to predict mutational patterns. In this study, we have classified the mutation sites into low and high mutability classes based on viral isolates count containing mutations. The physicochemical features and structural analysis of the SARS-CoV-2 proteins showed that features including residue type, surface accessibility, residue bulkiness, stability and sequence conservation at the mutation site were able to classify the low and high mutability sites. We further developed machine learning models using above-mentioned features, to predict low and high mutability sites at different selection thresholds (ranging 5-30% of topmost and bottommost mutated sites) and observed the improvement in performance as the selection threshold is reduced (prediction accuracy ranging from 65 to 77%). The analysis will be useful for early detection of variants of concern for the SARS-CoV-2, which can also be applied to other existing and emerging viruses for another pandemic prevention.
Collapse
|
10
|
Xu C, Zhang R, Duan M, Zhou Y, Bao J, Lu H, Wang J, Hu M, Hu Z, Zhou F, Zhu W. A polygenic stacking classifier revealed the complicated platelet transcriptomic landscape of adult immune thrombocytopenia. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 28:477-487. [PMID: 35505964 PMCID: PMC9046129 DOI: 10.1016/j.omtn.2022.04.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 04/01/2022] [Indexed: 01/19/2023]
Abstract
Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of a low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering, and stacking classification was carried out to detect the ITP biomarkers using RNA sequencing (RNA-seq) transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model, with an overall accuracy 0.974. The biomarkers suggested that ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes, and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php
Collapse
Affiliation(s)
- Chengfeng Xu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Ruochi Zhang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Meiyu Duan
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Yongming Zhou
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Jizhang Bao
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Hao Lu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Jie Wang
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Minghui Hu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
| | - Zhaoyang Hu
- Fun-Med Pharmaceutical Technology (Shanghai) Co., Ltd., RM. A310, 115 Xinjunhuan Road, Minhang District, Shanghai 201100, China
- Corresponding author Zhaoyang Hu, PhD, Fengneng Pharmaceutical Technology (Shanghai) Co., Ltd., RM. A310, 115 Xinjunhuan Road, Minhang District, Shanghai 201100, China.
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
- Corresponding author Fengfeng Zhou, PhD, College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China.
| | - Wenwei Zhu
- Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China
- Corresponding author Wenwei Zhu, PhD, Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China.
| |
Collapse
|
11
|
Su Y, Du K, Wang J, Wei JM, Liu J. Multi-variable AUC for sifting complementary features and its biomedical application. Brief Bioinform 2022; 23:6536295. [PMID: 35212712 DOI: 10.1093/bib/bbac029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 01/14/2022] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
Although sifting functional genes has been discussed for years, traditional selection methods tend to be ineffective in capturing potential specific genes. First, typical methods focus on finding features (genes) relevant to class while irrelevant to each other. However, the features that can offer rich discriminative information are more likely to be the complementary ones. Next, almost all existing methods assess feature relations in pairs, yielding an inaccurate local estimation and lacking a global exploration. In this paper, we introduce multi-variable Area Under the receiver operating characteristic Curve (AUC) to globally evaluate the complementarity among features by employing Area Above the receiver operating characteristic Curve (AAC). Due to AAC, the class-relevant information newly provided by a candidate feature and that preserved by the selected features can be achieved beyond pairwise computation. Furthermore, we propose an AAC-based feature selection algorithm, named Multi-variable AUC-based Combined Features Complementarity, to screen discriminative complementary feature combinations. Extensive experiments on public datasets demonstrate the effectiveness of the proposed approach. Besides, we provide a gene set about prostate cancer and discuss its potential biological significance from the machine learning aspect and based on the existing biomedical findings of some individual genes.
Collapse
Affiliation(s)
- Yue Su
- College of Computer Science at Nankai University, China
| | - Keyu Du
- College of Computer Science at Nankai University, China
| | - Jun Wang
- College of Mathematics and Statistics Science at Ludong University, China
| | - Jin-Mao Wei
- College of Computer Science at Nankai University, China
| | - Jian Liu
- College of Computer Science at Nankai University, China
| |
Collapse
|
12
|
Alvarez-Gonzalez R, Mendez-Vazquez A. Deep Learning Architecture Reduction for fMRI Data. Brain Sci 2022; 12:brainsci12020235. [PMID: 35203997 PMCID: PMC8870362 DOI: 10.3390/brainsci12020235] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 01/12/2022] [Indexed: 11/16/2022] Open
Abstract
In recent years, deep learning models have demonstrated an inherently better ability to tackle non-linear classification tasks, due to advances in deep learning architectures. However, much remains to be achieved, especially in designing deep convolutional neural network (CNN) configurations. The number of hyper-parameters that need to be optimized to achieve accuracy in classification problems increases with every layer used, and the selection of kernels in each CNN layer has an impact on the overall CNN performance in the training stage, as well as in the classification process. When a popular classifier fails to perform acceptably in practical applications, it may be due to deficiencies in the algorithm and data processing. Thus, understanding the feature extraction process provides insights to help optimize pre-trained architectures, better generalize the models, and obtain the context of each layer’s features. In this work, we aim to improve feature extraction through the use of a texture amortization map (TAM). An algorithm was developed to obtain characteristics from the filters amortizing the filter’s effect depending on the texture of the neighboring pixels. From the initial algorithm, a novel geometric classification score (GCS) was developed, in order to obtain a measure that indicates the effect of one class on another in a classification problem, in terms of the complexity of the learnability in every layer of the deep learning architecture. For this, we assume that all the data transformations in the inner layers still belong to a Euclidean space. In this scenario, we can evaluate which layers provide the best transformations in a CNN, allowing us to reduce the weights of the deep learning architecture using the geometric hypothesis.
Collapse
|
13
|
Wang R, Wang Z, Li Z, Lee TY. Residue-Residue Contact Can Be a Potential Feature for the Prediction of Lysine Crotonylation Sites. Front Genet 2022; 12:788467. [PMID: 35058968 PMCID: PMC8764140 DOI: 10.3389/fgene.2021.788467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/23/2021] [Indexed: 11/13/2022] Open
Abstract
Lysine crotonylation (Kcr) is involved in plenty of activities in the human body. Various technologies have been developed for Kcr prediction. Sequence-based features are typically adopted in existing methods, in which only linearly neighboring amino acid composition was considered. However, modified Kcr sites are neighbored by not only the linear-neighboring amino acid but also those spatially surrounding residues around the target site. In this paper, we have used residue-residue contact as a new feature for Kcr prediction, in which features encoded with not only linearly surrounding residues but also those spatially nearby the target site. Then, the spatial-surrounding residue was used as a new scheme for feature encoding for the first time, named residue-residue composition (RRC) and residue-residue pair composition (RRPC), which were used in supervised learning classification for Kcr prediction. As the result suggests, RRC and RRPC have achieved the best performance of RRC at an accuracy of 0.77 and an area under curve (AUC) value of 0.78, RRPC at an accuracy of 0.74, and an AUC value of 0.80. In order to show that the spatial feature is of a competitively high significance as other sequence-based features, feature selection was carried on those sequence-based features together with feature RRPC. In addition, different ranges of the surrounding amino acid compositions' radii were used for comparison of the performance. After result assessment, RRC and RRPC features have shown competitively outstanding performance as others or in some cases even around 0.20 higher in accuracy or 0.3 higher in AUC values compared with sequence-based features.
Collapse
Affiliation(s)
- Rulan Wang
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhuo Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China
| | - Zhongyan Li
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, China.,School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, China
| |
Collapse
|
14
|
Yin J, Li X, Li F, Lu Y, Zeng S, Zhu F. Identification of the key target profiles underlying the drugs of narrow therapeutic index for treating cancer and cardiovascular disease. Comput Struct Biotechnol J 2021; 19:2318-2328. [PMID: 33995923 PMCID: PMC8105181 DOI: 10.1016/j.csbj.2021.04.035] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 04/09/2021] [Accepted: 04/15/2021] [Indexed: 12/14/2022] Open
Abstract
An appropriate therapeutic index is crucial for drug discovery and development since narrow therapeutic index (NTI) drugs with slight dosage variation may induce severe adverse drug reactions or potential treatment failure. To date, the shared characteristics underlying the targets of NTI drugs have been explored by several studies, which have been applied to identify potential drug targets. However, the association between the drug therapeutic index and the related disease has not been dissected, which is important for revealing the NTI drug mechanism and optimizing drug design. Therefore, in this study, two classes of disease (cancers and cardiovascular disorders) with the largest number of NTI drugs were selected, and the target property of the corresponding NTI drugs was analyzed. By calculating the biological system profiles and human protein–protein interaction (PPI) network properties of drug targets and adopting an AI-based algorithm, differentiated features between two diseases were discovered to reveal the distinct underlying mechanisms of NTI drugs in different diseases. Consequently, ten shared features and four unique features were identified for both diseases to distinguish NTI from NNTI drug targets. These computational discoveries, as well as the newly found features, suggest that in the clinical study of avoiding narrow therapeutic index in those diseases, the ability of target to be a hub and the efficiency of target signaling in the human PPI network should be considered, and it could thus provide novel guidance in the drug discovery and clinical research process and help to estimate the drug safety of cancer and cardiovascular disease.
Collapse
Affiliation(s)
- Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiaoxu Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yinjing Lu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Su Zeng
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Hangzhou 310018, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Hangzhou 310018, China.,Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
15
|
Fu J, Zhang Y, Liu J, Lian X, Tang J, Zhu F. Pharmacometabonomics: data processing and statistical analysis. Brief Bioinform 2021; 22:6236068. [PMID: 33866355 DOI: 10.1093/bib/bbab138] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 02/09/2021] [Accepted: 03/23/2021] [Indexed: 12/14/2022] Open
Abstract
Individual variations in drug efficacy, side effects and adverse drug reactions are still challenging that cannot be ignored in drug research and development. The aim of pharmacometabonomics is to better understand the pharmacokinetic properties of drugs and monitor the drug effects on specific metabolic pathways. Here, we systematically reviewed the recent technological advances in pharmacometabonomics for better understanding the pathophysiological mechanisms of diseases as well as the metabolic effects of drugs on bodies. First, the advantages and disadvantages of all mainstream analytical techniques were compared. Second, many data processing strategies including filtering, missing value imputation, quality control-based correction, transformation, normalization together with the methods implemented in each step were discussed. Third, various feature selection and feature extraction algorithms commonly applied in pharmacometabonomics were described. Finally, the databases that facilitate current pharmacometabonomics were collected and discussed. All in all, this review provided guidance for researchers engaged in pharmacometabonomics and metabolomics, and it would promote the wide application of metabolomics in drug research and personalized medicine.
Collapse
Affiliation(s)
- Jianbo Fu
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Ying Zhang
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Jin Liu
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Xichen Lian
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Jing Tang
- Department of Bioinformatics in Chongqing Medical University, China
| | - Feng Zhu
- College of Pharmaceutical Sciences in Zhejiang University, China
| |
Collapse
|
16
|
RIFS2D: A two-dimensional version of a randomly restarted incremental feature selection algorithm with an application for detecting low-ranked biomarkers. Comput Biol Med 2021; 133:104405. [PMID: 33930763 DOI: 10.1016/j.compbiomed.2021.104405] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 04/13/2021] [Accepted: 04/13/2021] [Indexed: 12/20/2022]
Abstract
The era of big data introduces both opportunities and challenges for biomedical researchers. One of the inherent difficulties in the biomedical research field is to recruit large cohorts of samples, while high-throughput biotechnologies may produce thousands or even millions of features for each sample. Researchers tend to evaluate the individual correlation of each feature with the class label and use the incremental feature selection (IFS) strategy to select the top-ranked features with the best prediction performance. Recent experimental data showed that a subset of continuously ranked features randomly restarted from a low-ranked feature (an RIFS block) may outperform the subset of top-ranked features. This study proposed a feature selection Algorithm RIFS2D by integrating multiple RIFS blocks. A comprehensive comparative experiment was conducted with the IFS, RIFS and existing feature selection algorithms and demonstrated that a subset of low-ranked features may also achieve promising prediction performance. This study suggested that a prediction model with promising performance may be trained by low-ranked features, even when top-ranked features did not achieve satisfying prediction performance. Further comparative experiments were conducted between RIFS2D and t-tests for the detection of early-stage breast cancer. The data showed that the RIFS2D-recommended features achieved better prediction accuracy and were targeted by more drugs than the t-test top-ranked features.
Collapse
|
17
|
Zhang S, Amahong K, Sun X, Lian X, Liu J, Sun H, Lou Y, Zhu F, Qiu Y. The miRNA: a small but powerful RNA for COVID-19. Brief Bioinform 2021; 22:1137-1149. [PMID: 33675361 PMCID: PMC7989616 DOI: 10.1093/bib/bbab062] [Citation(s) in RCA: 108] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 02/05/2021] [Accepted: 02/08/2021] [Indexed: 12/12/2022] Open
Abstract
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a severe and rapidly evolving epidemic. Now, although a few drugs and vaccines have been proved for its treatment and prevention, little systematic comments are made to explain its susceptibility to humans. A few scattered studies used bioinformatics methods to explore the role of microRNA (miRNA) in COVID-19 infection. Combining these timely reports and previous studies about virus and miRNA, we comb through the available clues and seemingly make the perspective reasonable that the COVID-19 cleverly exploits the interplay between the small miRNA and other biomolecules to avoid being effectively recognized and attacked from host immune protection as well to deactivate functional genes that are crucial for immune system. In detail, SARS-CoV-2 can be regarded as a sponge to adsorb host immune-related miRNA, which forces host fall into dysfunction status of immune system. Besides, SARS-CoV-2 encodes its own miRNAs, which can enter host cell and are not perceived by the host's immune system, subsequently targeting host function genes to cause illnesses. Therefore, this article presents a reasonable viewpoint that the miRNA-based interplays between the host and SARS-CoV-2 may be the primary cause that SARS-CoV-2 accesses and attacks the host cells.
Collapse
Affiliation(s)
- Song Zhang
- College of Pharmaceutical Sciences in Zhejiang University and the First Affiliated Hospital of Zhejiang University School of Medicine, China
| | | | - Xiuna Sun
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Xichen Lian
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Jin Liu
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Huaicheng Sun
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Yan Lou
- Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, the First Affiliated Hospital, Zhejiang University School of Medicine, China
| | - Feng Zhu
- College of Pharmaceutical Sciences in Zhejiang University, China
| | - Yunqing Qiu
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, the First Affiliated Hospital, Zhejiang University School of Medicine, China
| |
Collapse
|
18
|
Yang Q, Li B, Chen S, Tang J, Li Y, Li Y, Zhang S, Shi C, Zhang Y, Mou M, Xue W, Zhu F. MMEASE: Online meta-analysis of metabolomic data by enhanced metabolite annotation, marker selection and enrichment analysis. J Proteomics 2021; 232:104023. [PMID: 33130111 DOI: 10.1016/j.jprot.2020.104023] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 10/12/2020] [Accepted: 10/22/2020] [Indexed: 12/17/2022]
Abstract
Large-scale and long-term metabolomic studies have attracted widespread attention in the biomedical studies yet remain challenging despite recent technique progresses. In particular, the ineffective way of experiment integration and limited capacity in metabolite annotation are known issues. Herein, we constructed an online tool MMEASE enabling the integration of multiple analytical experiments with an enhanced metabolite annotation and enrichment analysis (https://idrblab.org/mmease/). MMEASE was unique in capable of (1) integrating multiple analytical blocks; (2) providing enriched annotation for >330 thousands of metabolites; (3) conducting enrichment analysis using various categories/sub-categories. All in all, MMEASE aimed at supplying a comprehensive service for large-scale and long-term metabolomics, which might provide valuable guidance to current biomedical studies. SIGNIFICANCE: To facilitate the studies of large-scale and long-term metabolomic analysis, MMEASE was developed to (1) achieve the online integration of multiple datasets from different analytical experiments, (2) provide the most diverse strategies for marker discovery, enabling performance assessment and (3) significantly amplify metabolite annotation and subsequent enrichment analysis. MMEASE aimed at supplying a comprehensive service for long-term and large-scale metabolomics, which might provide valuable guidance to current biomedical studies.
Collapse
Affiliation(s)
- Qingxia Yang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China; Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, Chongqing 401331, China
| | - Sijie Chen
- School of Pharmaceutical Sciences, School of Big Data and Software Engineering, Chongqing University, Chongqing, Chongqing 401331, China
| | - Jing Tang
- Department of Bioinformatics, Chongqing Medical University, Chongqing, Chongqing 400016, China
| | - Yinghong Li
- School of Pharmaceutical Sciences, School of Big Data and Software Engineering, Chongqing University, Chongqing, Chongqing 401331, China
| | - Yi Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Song Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Cheng Shi
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Ying Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, School of Big Data and Software Engineering, Chongqing University, Chongqing, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China; School of Pharmaceutical Sciences, School of Big Data and Software Engineering, Chongqing University, Chongqing, Chongqing 401331, China.
| |
Collapse
|
19
|
Fu J, Luo Y, Mou M, Zhang H, Tang J, Wang Y, Zhu F. Advances in Current Diabetes Proteomics: From the Perspectives of Label- free Quantification and Biomarker Selection. Curr Drug Targets 2021; 21:34-54. [PMID: 31433754 DOI: 10.2174/1389450120666190821160207] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/17/2019] [Accepted: 07/24/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND Due to its prevalence and negative impacts on both the economy and society, the diabetes mellitus (DM) has emerged as a worldwide concern. In light of this, the label-free quantification (LFQ) proteomics and diabetic marker selection methods have been applied to elucidate the underlying mechanisms associated with insulin resistance, explore novel protein biomarkers, and discover innovative therapeutic protein targets. OBJECTIVE The purpose of this manuscript is to review and analyze the recent computational advances and development of label-free quantification and diabetic marker selection in diabetes proteomics. METHODS Web of Science database, PubMed database and Google Scholar were utilized for searching label-free quantification, computational advances, feature selection and diabetes proteomics. RESULTS In this study, we systematically review the computational advances of label-free quantification and diabetic marker selection methods which were applied to get the understanding of DM pathological mechanisms. Firstly, different popular quantification measurements and proteomic quantification software tools which have been applied to the diabetes studies are comprehensively discussed. Secondly, a number of popular manipulation methods including transformation, pretreatment (centering, scaling, and normalization), missing value imputation methods and a variety of popular feature selection techniques applied to diabetes proteomic data are overviewed with objective evaluation on their advantages and disadvantages. Finally, the guidelines for the efficient use of the computationbased LFQ technology and feature selection methods in diabetes proteomics are proposed. CONCLUSION In summary, this review provides guidelines for researchers who will engage in proteomics biomarker discovery and by properly applying these proteomic computational advances, more reliable therapeutic targets will be found in the field of diabetes mellitus.
Collapse
Affiliation(s)
- Jianbo Fu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| |
Collapse
|
20
|
Tang J, Wu X, Mou M, Wang C, Wang L, Li F, Guo M, Yin J, Xie W, Wang X, Wang Y, Ding Y, Xue W, Zhu F. GIMICA: host genetic and immune factors shaping human microbiota. Nucleic Acids Res 2021; 49:D715-D722. [PMID: 33045729 PMCID: PMC7779047 DOI: 10.1093/nar/gkaa851] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/09/2020] [Accepted: 10/08/2020] [Indexed: 01/09/2023] Open
Abstract
Besides the environmental factors having tremendous impacts on the composition of microbial community, the host factors have recently gained extensive attentions on their roles in shaping human microbiota. There are two major types of host factors: host genetic factors (HGFs) and host immune factors (HIFs). These factors of each type are essential for defining the chemical and physical landscapes inhabited by microbiota, and the collective consideration of both types have great implication to serve comprehensive health management. However, no database was available to provide the comprehensive factors of both types. Herein, a database entitled 'Host Genetic and Immune Factors Shaping Human Microbiota (GIMICA)' was constructed. Based on the 4257 microbes confirmed to inhabit nine sites of human body, 2851 HGFs (1368 single nucleotide polymorphisms (SNPs), 186 copy number variations (CNVs), and 1297 non-coding ribonucleic acids (RNAs)) modulating the expression of 370 microbes were collected, and 549 HIFs (126 lymphocytes and phagocytes, 387 immune proteins, and 36 immune pathways) regulating the abundance of 455 microbes were also provided. All in all, GIMICA enabled the collective consideration not only between different types of host factor but also between the host and environmental ones, which is freely accessible without login requirement at: https://idrblab.org/gimica/.
Collapse
Affiliation(s)
- Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Xianglu Wu
- Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Chuan Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Lidan Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Maiyuan Guo
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Wenqin Xie
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Xiaona Wang
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yingxiong Wang
- College of Basic Medicine, Chongqing Medical University, Chongqing 400016, China.,Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Yubin Ding
- Joint International Research Lab of Reproductive and Development, Department of Reproductive Biology, School of Public Health, Chongqing Medical University, Chongqing 400016, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
21
|
Kumar R, Dhanda SK. Bird Eye View of Protein Subcellular Localization Prediction. Life (Basel) 2020; 10:E347. [PMID: 33327400 PMCID: PMC7764902 DOI: 10.3390/life10120347] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/11/2020] [Accepted: 12/11/2020] [Indexed: 12/12/2022] Open
Abstract
Proteins are made up of long chain of amino acids that perform a variety of functions in different organisms. The activity of the proteins is determined by the nucleotide sequence of their genes and by its 3D structure. In addition, it is essential for proteins to be destined to their specific locations or compartments to perform their structure and functions. The challenge of computational prediction of subcellular localization of proteins is addressed in various in silico methods. In this review, we reviewed the progress in this field and offered a bird eye view consisting of a comprehensive listing of tools, types of input features explored, machine learning approaches employed, and evaluation matrices applied. We hope the review will be useful for the researchers working in the field of protein localization predictions.
Collapse
Affiliation(s)
- Ravindra Kumar
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, NIH, 9609 Medical Center Drive, Rockville, MD 20850, USA
| | - Sandeep Kumar Dhanda
- Department of Oncology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
22
|
A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8926750. [PMID: 33133228 PMCID: PMC7591939 DOI: 10.1155/2020/8926750] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 08/14/2020] [Accepted: 09/16/2020] [Indexed: 12/14/2022]
Abstract
With the development of computer technology, many machine learning algorithms have been applied to the field of biology, forming the discipline of bioinformatics. Protein function prediction is a classic research topic in this subject area. Though many scholars have made achievements in identifying protein by different algorithms, they often extract a large number of feature types and use very complex classification methods to obtain little improvement in the classification effect, and this process is very time-consuming. In this research, we attempt to utilize as few features as possible to classify vesicular transportation proteins and to simultaneously obtain a comparative satisfactory classification result. We adopt CTDC which is a submethod of the method of composition, transition, and distribution (CTD) to extract only 39 features from each sequence, and LibSVM is used as the classification method. We use the SMOTE method to deal with the problem of dataset imbalance. There are 11619 protein sequences in our dataset. We selected 4428 sequences to train our classification model and selected other 1832 sequences from our dataset to test the classification effect and finally achieved an accuracy of 71.77%. After dimension reduction by MRMD, the accuracy is 72.16%.
Collapse
|
23
|
Shukla N, Siva N, Malik B, Suravajhala P. Current Challenges and Implications of Proteogenomic Approaches in Prostate Cancer. Curr Top Med Chem 2020; 20:1968-1980. [PMID: 32703135 DOI: 10.2174/1568026620666200722112450] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 05/30/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
In the recent past, next-generation sequencing (NGS) approaches have heralded the omics era. With NGS data burgeoning, there arose a need to disseminate the omic data better. Proteogenomics has been vividly used for characterising the functions of candidate genes and is applied in ascertaining various diseased phenotypes, including cancers. However, not much is known about the role and application of proteogenomics, especially Prostate Cancer (PCa). In this review, we outline the need for proteogenomic approaches, their applications and their role in PCa.
Collapse
Affiliation(s)
- Nidhi Shukla
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India.,Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Narmadhaa Siva
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| | - Babita Malik
- Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Prashanth Suravajhala
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| |
Collapse
|
24
|
Tang J, Wang Y, Luo Y, Fu J, Zhang Y, Li Y, Xiao Z, Lou Y, Qiu Y, Zhu F. Computational advances of tumor marker selection and sample classification in cancer proteomics. Comput Struct Biotechnol J 2020; 18:2012-2025. [PMID: 32802273 PMCID: PMC7403885 DOI: 10.1016/j.csbj.2020.07.009] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Revised: 07/06/2020] [Accepted: 07/08/2020] [Indexed: 12/11/2022] Open
Abstract
Cancer proteomics has become a powerful technique for characterizing the protein markers driving transformation of malignancy, tracing proteome variation triggered by therapeutics, and discovering the novel targets and drugs for the treatment of oncologic diseases. To facilitate cancer diagnosis/prognosis and accelerate drug target discovery, a variety of methods for tumor marker identification and sample classification have been developed and successfully applied to cancer proteomic studies. This review article describes the most recent advances in those various approaches together with their current applications in cancer-related studies. Firstly, a number of popular feature selection methods are overviewed with objective evaluation on their advantages and disadvantages. Secondly, these methods are grouped into three major classes based on their underlying algorithms. Finally, a variety of sample separation algorithms are discussed. This review provides a comprehensive overview of the advances on tumor maker identification and patients/samples/tissues separations, which could be guidance to the researches in cancer proteomics.
Collapse
Key Words
- ANN, Artificial Neural Network
- ANOVA, Analysis of Variance
- CFS, Correlation-based Feature Selection
- Cancer proteomics
- Computational methods
- DAPC, Discriminant Analysis of Principal Component
- DT, Decision Trees
- EDA, Estimation of Distribution Algorithm
- FC, Fold Change
- GA, Genetic Algorithms
- GR, Gain Ratio
- HC, Hill Climbing
- HCA, Hierarchical Cluster Analysis
- IG, Information Gain
- LDA, Linear Discriminant Analysis
- LIMMA, Linear Models for Microarray Data
- MBF, Markov Blanket Filter
- MWW, Mann–Whitney–Wilcoxon test
- OPLS-DA, Orthogonal Partial Least Squares Discriminant Analysis
- PCA, Principal Component Analysis
- PLS-DA, Partial Least Square Discriminant Analysis
- RF, Random Forest
- RF-RFE, Random Forest with Recursive Feature Elimination
- SA, Simulated Annealing
- SAM, Significance Analysis of Microarrays
- SBE, Sequential Backward Elimination
- SFS, and Sequential Forward Selection
- SOM, Self-organizing Map
- SU, Symmetrical Uncertainty
- SVM, Support Vector Machine
- SVM-RFE, Support Vector Machine with Recursive Feature Elimination
- Sample classification
- Tumor marker selection
- sPLSDA, Sparse Partial Least Squares Discriminant Analysis
- t-SNE, Student t Distribution
- χ2, Chi-square
Collapse
Affiliation(s)
- Jing Tang
- Department of Bioinformatics, Chongqing Medical University, Chongqing 400016, China.,College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jianbo Fu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yang Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| | - Yi Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ziyu Xiao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yan Lou
- Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou 310000, China
| | - Yunqing Qiu
- Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou 310000, China
| | - Feng Zhu
- Department of Bioinformatics, Chongqing Medical University, Chongqing 400016, China.,College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
25
|
Tang J, Mou M, Wang Y, Luo Y, Zhu F. MetaFS: Performance assessment of biomarker discovery in metaproteomics. Brief Bioinform 2020; 22:5854399. [PMID: 32510556 DOI: 10.1093/bib/bbaa105] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 04/17/2020] [Accepted: 05/05/2020] [Indexed: 12/19/2022] Open
Abstract
Metaproteomics suffers from the issues of dimensionality and sparsity. Data reduction methods can maximally identify the relevant subset of significant differential features and reduce data redundancy. Feature selection (FS) methods were applied to obtain the significant differential subset. So far, a variety of feature selection methods have been developed for metaproteomic study. However, due to FS's performance depended heavily on the data characteristics of a given research, the well-suitable feature selection method must be carefully selected to obtain the reproducible differential proteins. Moreover, it is critical to evaluate the performance of each FS method according to comprehensive criteria, because the single criterion is not sufficient to reflect the overall performance of the FS method. Therefore, we developed an online tool named MetaFS, which provided 13 types of FS methods and conducted the comprehensive evaluation on the complex FS methods using four widely accepted and independent criteria. Furthermore, the function and reliability of MetaFS were systematically tested and validated via two case studies. In sum, MetaFS could be a distinguished tool for discovering the overall well-performed FS method for selecting the potential biomarkers in microbiome studies. The online tool is freely available at https://idrblab.org/metafs/.
Collapse
|
26
|
Chen H, Li F, Wang L, Jin Y, Chi CH, Kurgan L, Song J, Shen J. Systematic evaluation of machine learning methods for identifying human-pathogen protein-protein interactions. Brief Bioinform 2020; 22:5847611. [PMID: 32459334 DOI: 10.1093/bib/bbaa068] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 03/31/2020] [Accepted: 04/01/2020] [Indexed: 12/11/2022] Open
Abstract
In recent years, high-throughput experimental techniques have significantly enhanced the accuracy and coverage of protein-protein interaction identification, including human-pathogen protein-protein interactions (HP-PPIs). Despite this progress, experimental methods are, in general, expensive in terms of both time and labour costs, especially considering that there are enormous amounts of potential protein-interacting partners. Developing computational methods to predict interactions between human and bacteria pathogen has thus become critical and meaningful, in both facilitating the detection of interactions and mining incomplete interaction maps. In this paper, we present a systematic evaluation of machine learning-based computational methods for human-bacterium protein-protein interactions (HB-PPIs). We first reviewed a vast number of publicly available databases of HP-PPIs and then critically evaluate the availability of these databases. Benefitting from its well-structured nature, we subsequently preprocess the data and identified six bacterium pathogens that could be used to study bacterium subjects in which a human was the host. Additionally, we thoroughly reviewed the literature on 'host-pathogen interactions' whereby existing models were summarized that we used to jointly study the impact of different feature representation algorithms and evaluate the performance of existing machine learning computational models. Owing to the abundance of sequence information and the limited scale of other protein-related information, we adopted the primary protocol from the literature and dedicated our analysis to a comprehensive assessment of sequence information and machine learning models. A systematic evaluation of machine learning models and a wide range of feature representation algorithms based on sequence information are presented as a comparison survey towards the prediction performance evaluation of HB-PPIs.
Collapse
|
27
|
Li F, Chen J, Ge Z, Wen Y, Yue Y, Hayashida M, Baggag A, Bensmail H, Song J. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform 2020; 22:2126-2140. [PMID: 32363397 DOI: 10.1093/bib/bbaa049] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 02/25/2020] [Accepted: 03/11/2020] [Indexed: 12/12/2022] Open
Abstract
Promoters are short consensus sequences of DNA, which are responsible for transcription activation or the repression of all genes. There are many types of promoters in bacteria with important roles in initiating gene transcription. Therefore, solving promoter-identification problems has important implications for improving the understanding of their functions. To this end, computational methods targeting promoter classification have been established; however, their performance remains unsatisfactory. In this study, we present a novel stacked-ensemble approach (termed SELECTOR) for identifying both promoters and their respective classification. SELECTOR combined the composition of k-spaced nucleic acid pairs, parallel correlation pseudo-dinucleotide composition, position-specific trinucleotide propensity based on single-strand, and DNA strand features and using five popular tree-based ensemble learning algorithms to build a stacked model. Both 5-fold cross-validation tests using benchmark datasets and independent tests using the newly collected independent test dataset showed that SELECTOR outperformed state-of-the-art methods in both general and specific types of promoter prediction in Escherichia coli. Furthermore, this novel framework provides essential interpretations that aid understanding of model success by leveraging the powerful Shapley Additive exPlanation algorithm, thereby highlighting the most important features relevant for predicting both general and specific types of promoters and overcoming the limitations of existing 'Black-box' approaches that are unable to reveal causal relationships from large amounts of initially encoded features.
Collapse
Affiliation(s)
- Fuyi Li
- Northwest A&F University, China.,Department of Biochemistry and Molecular Biology and the Infection and Immunity Program, Biomedicine Discovery Institute, Monash University, Australia
| | - Jinxiang Chen
- Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University from the College of Information Engineering, Northwest A&F University, China
| | - Zongyuan Ge
- Monash University and also serves as a Deep Learning Specialist at NVIDIA AI Technology Centre. Before joining Monash, he was a research scientist at IBM Research Australia doing research in medical AI during 2016-2018. His research interests are AI, computer vision, medical image, robotics and deep learning
| | - Ya Wen
- computer technology from Ningxia University, China
| | - Yanwei Yue
- medical science from Southern Medical University, China
| | - Morihiro Hayashida
- informatics from Kyoto University, Japan, in 2005. He is an Assistant Professor in the Department of Electrical Engineering and Computer Science, National Institute of Technology, Matsue College, Japan
| | - Abdelkader Baggag
- computer science from the University of Minnesota. He is a Senior Scientist at the Qatar Computing Research Institute (QCRI) and has a joint appointment as an Associate Professor at Hamad Bin Khalifa University (HBKU) in the Division of Information and Computing Technology. His research interests include data mining, linear algebra and machine learning
| | - Halima Bensmail
- University of Pierre & Marie Currie (Paris 6) in France. She is currently a Principal Scientist at QCRI-HBKU and a joint Associate Professor at the College of Computer and Science Engineering, HBKU
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia. He is also affiliated with the Monash Centre for Data Science, Faculty of Information Technology, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining, and pattern recognition
| |
Collapse
|
28
|
Metaproteomics characterizes human gut microbiome function in colorectal cancer. NPJ Biofilms Microbiomes 2020; 6:14. [PMID: 32210237 PMCID: PMC7093434 DOI: 10.1038/s41522-020-0123-4] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 02/26/2020] [Indexed: 02/08/2023] Open
Abstract
Pathogenesis of colorectal cancer (CRC) is associated with alterations in gut microbiome. Previous studies have focused on the changes of taxonomic abundances by metagenomics. Variations of the function of intestinal bacteria in CRC patients compared to healthy crowds remain largely unknown. Here we collected fecal samples from CRC patients and healthy volunteers and characterized their microbiome using quantitative metaproteomic method. We have identified and quantified 91,902 peptides, 30,062 gut microbial protein groups, and 195 genera of microbes. Among the proteins, 341 were found significantly different in abundance between the CRC patients and the healthy volunteers. Microbial proteins related to iron intake/transport; oxidative stress; and DNA replication, recombination, and repair were significantly alternated in abundance as a result of high local concentration of iron and high oxidative stress in the large intestine of CRC patients. Our study shows that metaproteomics can provide functional information on intestinal microflora that is of great value for pathogenesis research, and can help guide clinical diagnosis in the future.
Collapse
|
29
|
Tao J, Hao Y, Li X, Yin H, Nie X, Zhang J, Xu B, Chen Q, Li B. Systematic Identification of Housekeeping Genes Possibly Used as References in Caenorhabditis elegans by Large-Scale Data Integration. Cells 2020; 9:786. [PMID: 32213971 PMCID: PMC7140892 DOI: 10.3390/cells9030786] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 03/11/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022] Open
Abstract
For accurate gene expression quantification, normalization of gene expression data against reliable reference genes is required. It is known that the expression levels of commonly used reference genes vary considerably under different experimental conditions, and therefore, their use for data normalization is limited. In this study, an unbiased identification of reference genes in Caenorhabditis elegans was performed based on 145 microarray datasets (2296 gene array samples) covering different developmental stages, different tissues, drug treatments, lifestyle, and various stresses. As a result, thirteen housekeeping genes (rps-23, rps-26, rps-27, rps-16, rps-2, rps-4, rps-17, rpl-24.1, rpl-27, rpl-33, rpl-36, rpl-35, and rpl-15) with enhanced stability were comprehensively identified by using six popular normalization algorithms and RankAggreg method. Functional enrichment analysis revealed that these genes were significantly overrepresented in GO terms or KEGG pathways related to ribosomes. Validation analysis using recently published datasets revealed that the expressions of newly identified candidate reference genes were more stable than the commonly used reference genes. Based on the results, we recommended using rpl-33 and rps-26 as the optimal reference genes for microarray and rps-2 and rps-4 for RNA-sequencing data validation. More importantly, the most stable rps-23 should be a promising reference gene for both data types. This study, for the first time, successfully displays a large-scale microarray data driven genome-wide identification of stable reference genes for normalizing gene expression data and provides a potential guideline on the selection of universal internal reference genes in C. elegans, for quantitative gene expression analysis.
Collapse
Affiliation(s)
- Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Xudong Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Huachun Yin
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Xiner Nie
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Jie Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Boying Xu
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| | - Qiao Chen
- Scientific Research Office, Chongqing Normal University, Chongqing 401331, China;
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (J.T.); (Y.H.); (X.L.); (H.Y.); (X.N.); (J.Z.); (B.X.)
| |
Collapse
|
30
|
Zhang Y, Chen C, Duan M, Liu S, Huang L, Zhou F. BioDog, biomarker detection for improving identification power of breast cancer histologic grade in methylomics. Epigenomics 2019; 11:1717-1732. [PMID: 31625763 DOI: 10.2217/epi-2019-0230] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Aim: Breast cancer histologic grade (HG) is a well-established prognostic factor. This study aimed to select methylomic biomarkers to predict breast cancer HGs. Materials & methods: The proposed algorithm BioDog firstly used correlation bias reduction strategy to eliminate redundant features. Then incremental feature selection was applied to find the features with a high HG prediction accuracy. The sequential backward feature elimination strategy was employed to further refine the biomarkers. A comparison with existing algorithms were conducted. The HG-specific somatic mutations were investigated. Results & conclusions: BioDog achieved accuracy 0.9973 using 92 methylomic biomarkers for predicting breast cancer HGs. Many of these biomarkers were within the genes and lncRNAs associated with the HG development in breast cancer or other cancer types.
Collapse
Affiliation(s)
- Yexian Zhang
- College of Computer Science & Technology, & Key Laboratory of Symbolic Computation & Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, PR China
| | - Chaorong Chen
- College of Software, Jilin University, Changchun, Jilin 130012, PR China
| | - Meiyu Duan
- College of Computer Science & Technology, & Key Laboratory of Symbolic Computation & Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, PR China
| | - Shuai Liu
- College of Computer Science & Technology, & Key Laboratory of Symbolic Computation & Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, PR China
| | - Lan Huang
- College of Computer Science & Technology, & Key Laboratory of Symbolic Computation & Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, PR China
| | - Fengfeng Zhou
- College of Computer Science & Technology, & Key Laboratory of Symbolic Computation & Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, PR China
| |
Collapse
|
31
|
Yang Q, Wang Y, Li F, Zhang S, Luo Y, Li Y, Tang J, Li B, Chen Y, Xue W, Zhu F. Identification of the gene signature reflecting schizophrenia's etiology by constructing artificial intelligence-based method of enhanced reproducibility. CNS Neurosci Ther 2019; 25:1054-1063. [PMID: 31350824 PMCID: PMC6698965 DOI: 10.1111/cns.13196] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 06/27/2019] [Accepted: 07/03/2019] [Indexed: 12/15/2022] Open
Abstract
AIMS As one of the most fundamental questions in modern science, "what causes schizophrenia (SZ)" remains a profound mystery due to the absence of objective gene markers. The reproducibility of the gene signatures identified by independent studies is found to be extremely low due to the incapability of available feature selection methods and the lack of measurement on validating signatures' robustness. These irreproducible results have significantly limited our understanding of the etiology of SZ. METHODS In this study, a new feature selection strategy was developed, and a comprehensive analysis was then conducted to ensure a reliable signature discovery. Particularly, the new strategy (a) combined multiple randomized sampling with consensus scoring and (b) assessed gene ranking consistency among different datasets, and a comprehensive analysis among nine independent studies was conducted. RESULTS Based on a first-ever evaluation of methods' reproducibility that was cross-validated by nine independent studies, the newly developed strategy was found to be superior to the traditional ones. As a result, 33 genes were consistently identified from multiple datasets by the new strategy as differentially expressed, which might facilitate our understanding of the mechanism underlying the etiology of SZ. CONCLUSION A new strategy capable of enhancing the reproducibility of feature selection in current SZ research was successfully constructed and validated. A group of candidate genes identified in this study should be considered as great potential for revealing the etiology of SZ.
Collapse
Affiliation(s)
- Qing‐Xia Yang
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
- School of Pharmaceutical SciencesChongqing UniversityChongqingChina
| | - Yun‐Xia Wang
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
| | - Feng‐Cheng Li
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
| | - Song Zhang
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
| | - Yong‐Chao Luo
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
| | - Yi Li
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
| | - Jing Tang
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
- School of Pharmaceutical SciencesChongqing UniversityChongqingChina
| | - Bo Li
- School of Pharmaceutical SciencesChongqing UniversityChongqingChina
| | - Yu‐Zong Chen
- Bioinformatics and Drug Design Group, Department of PharmacyNational University of SingaporeSingaporeSingapore
| | - Wei‐Wei Xue
- School of Pharmaceutical SciencesChongqing UniversityChongqingChina
| | - Feng Zhu
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina
- School of Pharmaceutical SciencesChongqing UniversityChongqingChina
| |
Collapse
|