1
|
Shen K, Din AU, Sinha B, Zhou Y, Qian F, Shen B. Translational informatics for human microbiota: data resources, models and applications. Brief Bioinform 2023; 24:7152256. [PMID: 37141135 DOI: 10.1093/bib/bbad168] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 04/07/2023] [Accepted: 04/11/2023] [Indexed: 05/05/2023] Open
Abstract
With the rapid development of human intestinal microbiology and diverse microbiome-related studies and investigations, a large amount of data have been generated and accumulated. Meanwhile, different computational and bioinformatics models have been developed for pattern recognition and knowledge discovery using these data. Given the heterogeneity of these resources and models, we aimed to provide a landscape of the data resources, a comparison of the computational models and a summary of the translational informatics applied to microbiota data. We first review the existing databases, knowledge bases, knowledge graphs and standardizations of microbiome data. Then, the high-throughput sequencing techniques for the microbiome and the informatics tools for their analyses are compared. Finally, translational informatics for the microbiome, including biomarker discovery, personalized treatment and smart healthcare for complex diseases, are discussed.
Collapse
Affiliation(s)
- Ke Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Ahmad Ud Din
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Baivab Sinha
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Yi Zhou
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Fuliang Qian
- Center for Systems Biology, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Suzhou 215123, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| |
Collapse
|
2
|
Shtossel O, Isakov H, Turjeman S, Koren O, Louzoun Y. Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy. Gut Microbes 2023; 15:2224474. [PMID: 37345233 PMCID: PMC10288916 DOI: 10.1080/19490976.2023.2224474] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 06/08/2023] [Indexed: 06/23/2023] Open
Abstract
The human gut microbiome is associated with a large number of disease etiologies. As such, it is a natural candidate for machine-learning-based biomarker development for multiple diseases and conditions. The microbiome is often analyzed using 16S rRNA gene sequencing or shotgun metagenomics. However, several properties of microbial sequence-based studies hinder machine learning (ML), including non-uniform representation, a small number of samples compared with the dimension of each sample, and sparsity of the data, with the majority of taxa present in a small subset of samples. We show here using a graph representation that the cladogram structure is as informative as the taxa frequency. We then suggest a novel method to combine information from different taxa and improve data representation for ML using microbial taxonomy. iMic (image microbiome) translates the microbiome to images through an iterative ordering scheme, and applies convolutional neural networks to the resulting image. We show that iMic has a higher precision in static microbiome gene sequence-based ML than state-of-the-art methods. iMic also facilitates the interpretation of the classifiers through an explainable artificial intelligence (AI) algorithm to iMic to detect taxa relevant to each condition. iMic is then extended to dynamic microbiome samples by translating them to movies.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Haim Isakov
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Sondra Turjeman
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
3
|
Chrisman BS, Paskov KM, Stockham N, Jung JY, Varma M, Washington PY, Tataru C, Iwai S, DeSantis TZ, David M, Wall DP. Improved detection of disease-associated gut microbes using 16S sequence-based biomarkers. BMC Bioinformatics 2021; 22:509. [PMID: 34666677 PMCID: PMC8527694 DOI: 10.1186/s12859-021-04427-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 10/06/2021] [Indexed: 12/31/2022] Open
Abstract
Background Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction. Results On two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84. Conclusions SBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from http://github.com/briannachrisman/16s_biomarkers.
Collapse
Affiliation(s)
- Brianna S Chrisman
- Department of Bioengineering, Stanford University, Serra Mall, Stanford, USA.
| | - Kelley M Paskov
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Serra Mall, Stanford, USA
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA
| | - Maya Varma
- Department of Computer Science, Stanford University, Serra Mall, Stanford, USA
| | - Peter Y Washington
- Department of Bioengineering, Stanford University, Serra Mall, Stanford, USA
| | - Christine Tataru
- Department of Computer Science, Oregon State University, SW Campus Way, Corvallis, USA
| | - Shoko Iwai
- Second Genome Inc, Allerton Ave, Brisbane, USA
| | | | - Maude David
- Department of Microbiology, Oregon State University, SW Campus Way, Corvallis, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Serra Mall, Stanford, USA.,Department of Pediatrics (Systems Medicine), Stanford University, 1265 Welch Road, Stanford, USA
| |
Collapse
|
4
|
Abbasi K, Razzaghi P, Poso A, Ghanbari-Ara S, Masoudi-Nejad A. Deep Learning in Drug Target Interaction Prediction: Current and Future Perspectives. Curr Med Chem 2021; 28:2100-2113. [PMID: 32895036 DOI: 10.2174/0929867327666200907141016] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 07/30/2020] [Accepted: 07/30/2020] [Indexed: 11/22/2022]
Abstract
Drug-target Interactions (DTIs) prediction plays a central role in drug discovery. Computational methods in DTIs prediction have gained more attention because carrying out in vitro and in vivo experiments on a large scale is costly and time-consuming. Machine learning methods, especially deep learning, are widely applied to DTIs prediction. In this study, the main goal is to provide a comprehensive overview of deep learning-based DTIs prediction approaches. Here, we investigate the existing approaches from multiple perspectives. We explore these approaches to find out which deep network architectures are utilized to extract features from drug compound and protein sequences. Also, the advantages and limitations of each architecture are analyzed and compared. Moreover, we explore the process of how to combine descriptors for drug and protein features. Likewise, a list of datasets that are commonly used in DTIs prediction is investigated. Finally, current challenges are discussed and a short future outlook of deep learning in DTI prediction is given.
Collapse
Affiliation(s)
- Karim Abbasi
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| | - Parvin Razzaghi
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| | - Antti Poso
- School of Pharmacy, Faculty of Health Sciences, University of Eastern Finland, Kuopio 80100, Finland
| | - Saber Ghanbari-Ara
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| |
Collapse
|
5
|
Song K, Wright FA, Zhou YH. Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction. Front Mol Biosci 2020; 7:610845. [PMID: 33392266 PMCID: PMC7772236 DOI: 10.3389/fmolb.2020.610845] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 11/25/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.
Collapse
Affiliation(s)
- Kuncheng Song
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Fred A Wright
- Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
6
|
Kwak MS, Cha JM, Shin HP, Jeon JW, Yoon JY. Development of a Novel Metagenomic Biomarker for Prediction of Upper Gastrointestinal Tract Involvement in Patients With Crohn's Disease. Front Microbiol 2020; 11:1162. [PMID: 32582102 PMCID: PMC7283919 DOI: 10.3389/fmicb.2020.01162] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Accepted: 05/06/2020] [Indexed: 12/28/2022] Open
Abstract
The human gut microbiota is an important component in the pathogenesis of Crohn's disease (CD), promoting host-microbe imbalances and disturbing intestinal and immune homeostasis. We aimed to assess the potential clinical usefulness of the colonic tissue microbiome for obtaining biomarkers for upper gastrointestinal (UGI) tract involvement in CD. We analyzed colonic tissue samples from 26 CD patients (13 with and 13 without UGI involvement at diagnosis) from the Inflammatory Bowel Disease Multi-Omics Database. QIIME1, DiTaxa, linear discriminant analysis effect size (LEfSe), and PICRUSt2 methods were used to examine microbial dysbiosis. Linear support vector machine (SVM) and random forest classifier (RF) algorithms were used to identify the UGI tract involvement-associated biomarkers. There were no statistically significant differences in community richness, phylogenetic diversity, and phylogenetic distance between the two groups of CD patients. DiTaxa analysis predicted significant association of the species Ruminococcus torques with UGI involvement, which was confirmed by the LEfSe analysis (P = 0.025). For the feature ranking method in both linear SVM and RF models, the species R. torques and age at diagnosis contributed to the combined models. The L-methionine biosynthesis III (P = 0.038) and palmitate biosynthesis II (P = 0.050) were under-represented in CD with UGI involvement. These findings suggest that R. torques might serve as a novel potential biomarker for UGI involvement in CD and its correlations, in addition to a range of bacterial species. The mechanisms of interaction between hosts and R. torques should be further investigated.
Collapse
Affiliation(s)
- Min Seob Kwak
- Department of Internal Medicine, Kyung Hee University Hospital at Gangdong, College of Medicine, Kyung Hee University, Seoul, South Korea
| | | | | | | | | |
Collapse
|
7
|
Peng H. CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification. PeerJ 2020; 8:e8965. [PMID: 32341900 PMCID: PMC7179567 DOI: 10.7717/peerj.8965] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 03/24/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models. METHODS In this article, a Teiresias-like feature extraction algorithm to discover frequent sub-sequences (CFSP) is proposed. Although gaps are allowed in some motif discovery algorithms, the distance and number of gaps are limited. The proposed algorithm can find frequent sequence pairs with a larger gap. The combinations of frequent sub-sequences in given protracted sequences capture the long-distance correlation, which implies a specific molecular biological property. Hence, the proposed algorithm intends to discover the combinations. A set of frequent sub-sequences derived from nucleic acid sequences with order is used as a base frequent sub-sequence array. The mutation information is attached to each sub-sequence array to implement fuzzy matching. Thus, a mutate records a single nucleotide variant or nucleotides insertion/deletion (indel) to encode a slight difference between frequent sequences and a matched subsequence of a sequence under investigation. CONCLUSIONS The proposed algorithm has been validated with several nucleic acid sequence prediction case studies. These data demonstrate better results than the recently available feature descriptors based methods based on experimental data sets such as miRNA, piRNA, and Sigma 54 promoters. CFSP is implemented in C++ and shell script; the source code and related data are available at https://github.com/HePeng2016/CFSP.
Collapse
Affiliation(s)
- He Peng
- School of Information Science and Engineering, Xiamen University, Xiamen, Fujian, China
| |
Collapse
|
8
|
Meola M, Rifa E, Shani N, Delbès C, Berthoud H, Chassard C. DAIRYdb: a manually curated reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products. BMC Genomics 2019; 20:560. [PMID: 31286860 PMCID: PMC6615214 DOI: 10.1186/s12864-019-5914-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 06/18/2019] [Indexed: 12/14/2022] Open
Abstract
Background Reads assignment to taxonomic units is a key step in microbiome analysis pipelines. To date, accurate taxonomy annotation of 16S reads, particularly at species rank, is still challenging due to the short size of read sequences and differently curated classification databases. The close phylogenetic relationship between species encountered in dairy products, however, makes it crucial to annotate species accurately to achieve sufficient phylogenetic resolution for further downstream ecological studies or for food diagnostics. Curated databases dedicated to the environment of interest are expected to improve the accuracy and resolution of taxonomy annotation. Results We provide a manually curated database composed of 10’290 full-length 16S rRNA gene sequences from prokaryotes tailored for dairy products analysis (https://github.com/marcomeola/DAIRYdb). The performance of the DAIRYdb was compared with the universal databases Silva, LTP, RDP and Greengenes. The DAIRYdb significantly outperformed all other databases independently of the classification algorithm by enabling higher accurate taxonomy annotation down to the species rank. The DAIRYdb accurately annotates over 90% of the sequences of either single or paired hypervariable regions automatically. The manually curated DAIRYdb strongly improves taxonomic annotation accuracy for microbiome studies in dairy environments. The DAIRYdb is a practical solution that enables automatization of this key step, thus facilitating the routine application of NGS microbiome analyses for microbial ecology studies and diagnostics in dairy products. Electronic supplementary material The online version of this article (10.1186/s12864-019-5914-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marco Meola
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland.
| | - Etienne Rifa
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| | - Noam Shani
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland
| | - Céline Delbès
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| | - Hélène Berthoud
- Agroscope, Competence Division Methods Development and Analytics, Research Group Fermenting Organisms, Schwarzenburgstrasse 161, Bern, 3003, Switzerland
| | - Christophe Chassard
- Université Clermont Auvergne, INRA, VetAgro Sup, UMRF, 20 côte de Reyne, Aurillac, 15000, France
| |
Collapse
|
9
|
Madani A, Bakhaty A, Kim J, Mubarak Y, Mofrad M. Bridging finite element and machine learning modeling: stress prediction of arterial walls in atherosclerosis. J Biomech Eng 2019; 141:2729617. [PMID: 30912802 DOI: 10.1115/1.4043290] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Indexed: 11/08/2022]
Abstract
Finite element and machine learning modeling are two predictive paradigms that have rarely been bridged. In this study, we develop a parametric model to generate arterial geometries and accumulate a database of over 12,000 finite element simulations of mechanical behaviour and stress distribution in these arterial models representative of atherosclerotic plaques. We formulate the training data to predict the maximum von Mises stress which could indicate risk of plaque rupture. Trained deep learning models are able to accurately predict the max von Mises stress within 9.86% error on a held-out test set. The deep neural networks outperform alternative prediction models and performance scales with amount of training data. Lastly, we examine the importance of attributing features on stress value and location prediction to gain intuitions on the underlying process. Moreover, deep neural networks can capture the functional mapping described by the finite element method which has far-reaching implications for real-time and multi-scale prediction tasks in biomechanics.
Collapse
Affiliation(s)
- Ali Madani
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America
| | - Ahmed Bakhaty
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Civil Engineering, University of California, Berkeley, California, United States of America
| | - Jiwon Kim
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, United States of America
| | - Yara Mubarak
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Department of Civil Engineering, University of California, Berkeley, California, United States of America
| | - Mohammad Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California, United States of America; Molecular Biophysics and Integrative Bioimaging Division, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| |
Collapse
|
10
|
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 2019; 9:3577. [PMID: 30837494 PMCID: PMC6401088 DOI: 10.1038/s41598-019-38746-w] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/19/2018] [Indexed: 12/28/2022] Open
Abstract
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
Collapse
|