151
|
Zhang X, Wang W, Ren CX, Dai DQ. Learning representation for multiple biological networks via a robust graph regularized integration approach. Brief Bioinform 2021; 23:6381251. [PMID: 34607360 DOI: 10.1093/bib/bbab409] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 08/23/2021] [Accepted: 09/06/2021] [Indexed: 01/18/2023] Open
Abstract
Learning node representation is a fundamental problem in biological network analysis, as compact representation features reveal complicated network structures and carry useful information for downstream tasks such as link prediction and node classification. Recently, multiple networks that profile objects from different aspects are increasingly accumulated, providing the opportunity to learn objects from multiple perspectives. However, the complex common and specific information across different networks pose challenges to node representation methods. Moreover, ubiquitous noise in networks calls for more robust representation. To deal with these problems, we present a representation learning method for multiple biological networks. First, we accommodate the noise and spurious edges in networks using denoised diffusion, providing robust connectivity structures for the subsequent representation learning. Then, we introduce a graph regularized integration model to combine refined networks and compute common representation features. By using the regularized decomposition technique, the proposed model can effectively preserve the common structural property of different networks and simultaneously accommodate their specific information, leading to a consistent representation. A simulation study shows the superiority of the proposed method on different levels of noisy networks. Three network-based inference tasks, including drug-target interaction prediction, gene function identification and fine-grained species categorization, are conducted using representation features learned from our method. Biological networks at different scales and levels of sparsity are involved. Experimental results on real-world data show that the proposed method has robust performance compared with alternatives. Overall, by eliminating noise and integrating effectively, the proposed method is able to learn useful representations from multiple biological networks.
Collapse
Affiliation(s)
- Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Chuan-Xian Ren
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| |
Collapse
|
152
|
Elhaj-Abdou MEM, El-Dib H, El-Helw A, El-Habrouk M. Deep_CNN_LSTM_GO: Protein function prediction from amino-acid sequences. Comput Biol Chem 2021; 95:107584. [PMID: 34601431 DOI: 10.1016/j.compbiolchem.2021.107584] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 09/08/2021] [Accepted: 09/21/2021] [Indexed: 11/15/2022]
Abstract
Protein amino acid sequences can be used to determine the functions of the protein. However, determining the function of a single protein requires many resources and a tremendous amount of time. Computational Intelligence methods such as Deep learning have been shown to predict the proteins' functions. This paper proposes a hybrid deep neural network model to predict an unknown protein's functions from sequences. The proposed model is named Deep_CNN_LSTM_GO. Deep_CNN_LSTM_GO is an Integration between Convolutional Neural network (CNN) and Long Short-Term Memory (LSTM) Neural Network to learn features from amino acid sequences and outputs the three different Gene Ontology (GO). The gene ontology represents the protein functions in the three sub-ontologies: Molecular Functions (MF), Biological Process (BP), and Cellular Component (CC). The proposed model has been trained and tested using UniProt-SwissProt's dataset. Another test has been done using Computational Assessment of Function Annotation (CAFA) on the three sub-ontologies. The proposed model outperforms different methods proposed in the field with better performance using three different evaluation metrics (Fmax, Smin, and AUPR) in the three sub-ontologies (MF, BP, CC).
Collapse
Affiliation(s)
- Mohamed E M Elhaj-Abdou
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Hassan El-Dib
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | - Amr El-Helw
- Faculty of Engineering, Arab Academy for Science and Technology and Maritime Transport, Alexandria, Egypt.
| | | |
Collapse
|
153
|
Vu TTD, Jung J. Protein function prediction with gene ontology: from traditional to deep learning models. PeerJ 2021; 9:e12019. [PMID: 34513334 PMCID: PMC8395570 DOI: 10.7717/peerj.12019] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 07/29/2021] [Indexed: 11/25/2022] Open
Abstract
Protein function prediction is a crucial part of genome annotation. Prediction methods have recently witnessed rapid development, owing to the emergence of high-throughput sequencing technologies. Among the available databases for identifying protein function terms, Gene Ontology (GO) is an important resource that describes the functional properties of proteins. Researchers are employing various approaches to efficiently predict the GO terms. Meanwhile, deep learning, a fast-evolving discipline in data-driven approach, exhibits impressive potential with respect to assigning GO terms to amino acid sequences. Herein, we reviewed the currently available computational GO annotation methods for proteins, ranging from conventional to deep learning approach. Further, we selected some suitable predictors from among the reviewed tools and conducted a mini comparison of their performance using a worldwide challenge dataset. Finally, we discussed the remaining major challenges in the field, and emphasized the future directions for protein function prediction with GO.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si, Gyeonggi-do, South Korea
| |
Collapse
|
154
|
Holmgren SD, Boyles RR, Cronk RD, Duncan CG, Kwok RK, Lunn RM, Osborn KC, Thessen AE, Schmitt CP. Catalyzing Knowledge-Driven Discovery in Environmental Health Sciences through a Community-Driven Harmonized Language. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:8985. [PMID: 34501574 PMCID: PMC8430534 DOI: 10.3390/ijerph18178985] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 08/13/2021] [Accepted: 08/19/2021] [Indexed: 01/10/2023]
Abstract
Harmonized language is critical for helping researchers to find data, collecting scientific data to facilitate comparison, and performing pooled and meta-analyses. Using standard terms to link data to knowledge systems facilitates knowledge-driven analysis, allows for the use of biomedical knowledge bases for scientific interpretation and hypothesis generation, and increasingly supports artificial intelligence (AI) and machine learning. Due to the breadth of environmental health sciences (EHS) research and the continuous evolution in scientific methods, the gaps in standard terminologies, vocabularies, ontologies, and related tools hamper the capabilities to address large-scale, complex EHS research questions that require the integration of disparate data and knowledge sources. The results of prior workshops to advance a harmonized environmental health language demonstrate that future efforts should be sustained and grounded in scientific need. We describe a community initiative whose mission was to advance integrative environmental health sciences research via the development and adoption of a harmonized language. The products, outcomes, and recommendations developed and endorsed by this community are expected to enhance data collection and management efforts for NIEHS and the EHS community, making data more findable and interoperable. This initiative will provide a community of practice space to exchange information and expertise, be a coordination hub for identifying and prioritizing activities, and a collaboration platform for the development and adoption of semantic solutions. We encourage anyone interested in advancing this mission to engage in this community.
Collapse
Affiliation(s)
- Stephanie D. Holmgren
- Office of Data Science, National Institute of Environmental Health Sciences (NIEHS), Durham, NC 27709, USA;
| | | | | | - Christopher G. Duncan
- Genes, Environment, and Health Branch, Division of Extramural Research and Training, NIEHS, Durham, NC 27709, USA;
| | - Richard K. Kwok
- Epidemiology Branch, Division of Intramural Research, NIEHS, Durham, NC 27709, USA;
- Office of the Director, NIEHS, Bethesda, MD 20892, USA
| | - Ruth M. Lunn
- Integrative Health Assessment Branch, Division of the National Toxicology Program, NIEHS, Durham, NC 27709, USA;
| | | | - Anne E. Thessen
- Environmental and Molecular Toxicology Department, Oregon State University, Corvallis, OR 97331, USA;
| | - Charles P. Schmitt
- Office of Data Science, National Institute of Environmental Health Sciences (NIEHS), Durham, NC 27709, USA;
| |
Collapse
|
155
|
Visani GM, Hughes MC, Hassoun S. Enzyme Promiscuity Prediction Using Hierarchy-Informed Multi-Label Classification. Bioinformatics 2021; 37:2017–2024. [PMID: 33515234 PMCID: PMC8337005 DOI: 10.1093/bioinformatics/btab054] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 12/30/2020] [Accepted: 01/22/2021] [Indexed: 11/25/2022] Open
Abstract
MOTIVATION As experimental efforts are costly and time consuming, computational characterization of enzyme capabilities is an attractive alternative. We present and evaluate several machine-learning models to predict which of 983 distinct enzymes, as defined via the Enzyme Commission (EC) numbers, are likely to interact with a given query molecule. Our data consists of enzyme-substrate interactions from the BRENDA database. Some interactions are attributed to natural selection and involve the enzyme's natural substrates. The majority of the interactions however involve non-natural substrates, thus reflecting promiscuous enzymatic activities. RESULTS We frame this "enzyme promiscuity prediction" problem as a multi-label classification task. We maximally utilize inhibitor and unlabelled data to train prediction models that can take advantage of known hierarchical relationships between enzyme classes. We report that a hierarchical multi-label neural network, EPP-HMCNF, is the best model for solving this problem, outperforming k-nearest neighbours similarity-based and other machine learning models. We show that inhibitor information during training consistently improves predictive power, particularly for EPP-HMCNF. We also show that all promiscuity prediction models perform worse under a realistic data split when compared to a random data split, and when evaluating performance on non-natural substrates compared to natural substrates. AVAILABILITY AND IMPLEMENTATION We provide Python code for EPP-HMCNF and other models in a repository termed EPP (Enzyme Promiscuity Prediction) at https://github.com/hassounlab/EPP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gian Marco Visani
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | - Michael C Hughes
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | - Soha Hassoun
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
- Department of Chemical and Biological Engineering, Tufts University, Medford, MA 02155, USA
| |
Collapse
|
156
|
Jang WD, Kim GB, Kim Y, Lee SY. Applications of artificial intelligence to enzyme and pathway design for metabolic engineering. Curr Opin Biotechnol 2021; 73:101-107. [PMID: 34358728 DOI: 10.1016/j.copbio.2021.07.024] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 07/16/2021] [Accepted: 07/17/2021] [Indexed: 01/07/2023]
Abstract
Metabolic engineering for developing industrial strains capable of overproducing bioproducts requires good understanding of cellular metabolism, including metabolic reactions and enzymes. However, metabolic pathways and enzymes involved are still unknown for many products of interest, which presents a key challenge in their biological production. This challenge can be partly overcome by constructing novel biosynthetic pathways through enzyme and pathway design approaches. With the increase in bio-big data, data-driven approaches using artificial intelligence (AI) techniques are allowing more advanced protein and pathway design. In this paper, we review recent studies on AI-aided protein engineering and design, focusing on directed evolution that uses AI approaches to efficiently construct mutant libraries. Also, recent works of AI-aided pathway design strategies, including template-based and template-free approaches, are discussed.
Collapse
Affiliation(s)
- Woo Dae Jang
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; KAIST Institute for the BioCentury, KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and BioInformatics Research Center, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Yeji Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea; KAIST Institute for the BioCentury, KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and BioInformatics Research Center, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea.
| |
Collapse
|
157
|
Yang H, Wang M, Liu X, Zhao XM, Li A. PhosIDN: an integrated deep neural network for improving protein phosphorylation site prediction by combining sequence and protein-protein interaction information. Bioinformatics 2021; 37:4668-4676. [PMID: 34320631 PMCID: PMC8665744 DOI: 10.1093/bioinformatics/btab551] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 06/22/2021] [Accepted: 07/27/2021] [Indexed: 11/29/2022] Open
Abstract
Motivation Phosphorylation is one of the most studied post-translational modifications, which plays a pivotal role in various cellular processes. Recently, deep learning methods have achieved great success in prediction of phosphorylation sites, but most of them are based on convolutional neural network that may not capture enough information about long-range dependencies between residues in a protein sequence. In addition, existing deep learning methods only make use of sequence information for predicting phosphorylation sites, and it is highly desirable to develop a deep learning architecture that can combine heterogeneous sequence and protein–protein interaction (PPI) information for more accurate phosphorylation site prediction. Results We present a novel integrated deep neural network named PhosIDN, for phosphorylation site prediction by extracting and combining sequence and PPI information. In PhosIDN, a sequence feature encoding sub-network is proposed to capture not only local patterns but also long-range dependencies from protein sequences. Meanwhile, useful PPI features are also extracted in PhosIDN by a PPI feature encoding sub-network adopting a multi-layer deep neural network. Moreover, to effectively combine sequence and PPI information, a heterogeneous feature combination sub-network is introduced to fully exploit the complex associations between sequence and PPI features, and their combined features are used for final prediction. Comprehensive experiment results demonstrate that the proposed PhosIDN significantly improves the prediction performance of phosphorylation sites and compares favorably with existing general and kinase-specific phosphorylation site prediction methods. Availability and implementation PhosIDN is freely available at https://github.com/ustchangyuanyang/PhosIDN. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hangyuan Yang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Minghui Wang
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| | - Xia Liu
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence and Frontiers Center for Brain Science, China.,Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Ao Li
- School of Information Science and Technology, University of Science and Technology of China, Hefei AH230027, China.,Centers for Biomedical Engineering, University of Science and Technology of China, Hefei AH230027, China
| |
Collapse
|
158
|
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform 2021; 22:bbaa199. [PMID: 33049044 PMCID: PMC8293838 DOI: 10.1093/bib/bbaa199] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/03/2020] [Accepted: 08/04/2020] [Indexed: 12/13/2022] Open
Abstract
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
Collapse
Affiliation(s)
| | | | - Xin Gao
- Computational Bioscience Research Center and lead of the Structural and Functional Bioinformatics Group at King Abdullah University of Science and Technology
| | | |
Collapse
|
159
|
You R, Yao S, Mamitsuka H, Zhu S. DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction. Bioinformatics 2021; 37:i262-i271. [PMID: 34252926 PMCID: PMC8294856 DOI: 10.1093/bioinformatics/btab270] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/22/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are (i) a single model must be trained for each species and (ii) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network-based method for AFP to overcome these limitations. Results We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperforms a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirm the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement. Availability and implementation https://github.com/yourh/DeepGraphGO. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ronghui You
- School of Computer Science, Fudan University, Shanghai 200433, China
| | - Shuwei Yao
- School of Computer Science, Fudan University, Shanghai 200433, China
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture 611-0011, Japan.,Department of Computer Science, Aalto University, Espoo, Finland
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai 200433, China.,Ministry of Education, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Shanghai 200433, China.,Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China.,Zhangjiang Fudan International Innovation Center, Shanghai 200433, China
| |
Collapse
|
160
|
Notaro M, Frasca M, Petrini A, Gliozzo J, Casiraghi E, Robinson PN, Valentini G. HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction. Bioinformatics 2021; 37:4526-4533. [PMID: 34240108 DOI: 10.1093/bioinformatics/btab485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/15/2021] [Accepted: 07/04/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Automated protein function prediction is a complex multi-class, multi-label, structured classification problem in which protein functions are organized in a controlled vocabulary, according to the Gene Ontology (GO). "Hierarchy-unaware" classifiers, also known as "flat" methods, predict GO terms without exploiting the inherent structure of the ontology, potentially violating the True-Path-Rule (TPR) that governs the GO, while "hierarchy-aware" approaches, even if they obey the TPR, do not always show clear improvements with respect to flat methods, or do not scale well when applied to the full GO. RESULTS To overcome these limitations, we propose Hierarchical Ensemble Methods for Directed Acyclic Graphs (HEMDAG), a family of highly modular hierarchical ensembles of classifiers, able to build upon any flat method and to provide "TPR-safe" predictions, by leveraging a combination of isotonic regression and TPR learning strategies. Extensive experiments on synthetic and real data across several organisms firstly show that HEMDAG can be used as a general tool to improve the predictions of flat classifiers, and secondly that HEMDAG is competitive versus state-of-the-art hierarchy-aware learning methods proposed in the last CAFA international challenges. AVAILABILITY Fully-tested R code freely available at https://anaconda.org/bioconda/r-hemdag. Tutorial and documentation at https://hemdag.readthedocs.io. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marco Notaro
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Marco Frasca
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Alessandro Petrini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Jessica Gliozzo
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Elena Casiraghi
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, US
| | - Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy.,CINI, National Laboratory in Artificial Intelligence and Intelligent Systems-AIIS, Roma, Italy.,Data Science Research Center, Università degli Studi di Milano, Milano, 20133, Italy
| |
Collapse
|
161
|
Kulmanov M, Zhapa-Camacho F, Hoehndorf R. DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web. Nucleic Acids Res 2021; 49:W140-W146. [PMID: 34019664 PMCID: PMC8262746 DOI: 10.1093/nar/gkab373] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 04/18/2021] [Accepted: 04/26/2021] [Indexed: 11/24/2022] Open
Abstract
Understanding the functions of proteins is crucial to understand biological processes on a molecular level. Many more protein sequences are available than can be investigated experimentally. DeepGOPlus is a protein function prediction method based on deep learning and sequence similarity. DeepGOWeb makes the prediction model available through a website, an API, and through the SPARQL query language for interoperability with databases that rely on Semantic Web technologies. DeepGOWeb provides accurate and fast predictions and ensures that predicted functions are consistent with the Gene Ontology; it can provide predictions for any protein and any function in Gene Ontology. DeepGOWeb is freely available at https://deepgo.cbrc.kaust.edu.sa/.
Collapse
Affiliation(s)
- Maxat Kulmanov
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| | - Fernando Zhapa-Camacho
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
162
|
Chaudhari M, Thapa N, Roy K, Newman RH, Saigo H, B K C D. DeepRMethylSite: a deep learning based approach for prediction of arginine methylation sites in proteins. Mol Omics 2021; 16:448-454. [PMID: 32555810 DOI: 10.1039/d0mo00025f] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Methylation, which is one of the most prominent post-translational modifications on proteins, regulates many important cellular functions. Though several model-based methylation site predictors have been reported, all existing methods employ machine learning strategies, such as support vector machines and random forest, to predict sites of methylation based on a set of "hand-selected" features. As a consequence, the subsequent models may be biased toward one set of features. Moreover, due to the large number of features, model development can often be computationally expensive. In this paper, we propose an alternative approach based on deep learning to predict arginine methylation sites. Our model, which we termed DeepRMethylSite, is computationally less expensive than traditional feature-based methods while eliminating potential biases that can arise through features selection. Based on independent testing on our dataset, DeepRMethylSite achieved efficiency scores of 68%, 82% and 0.51 with respect to sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Importantly, in side-by-side comparisons with other state-of-the-art methylation site predictors, our method performs on par or better in all scoring metrics tested.
Collapse
Affiliation(s)
- Meenal Chaudhari
- Department of Computational Science and Engineering, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Niraj Thapa
- Department of Computational Science and Engineering, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Kaushik Roy
- Department of Computer Science, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Robert H Newman
- Department of Biology, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Hiroto Saigo
- Department of Informatics, Kyushu University, Fukuoka 819-0395, Japan
| | - Dukka B K C
- Electrical Engineering and Computer Science Department, Wichita State University, Wichita, KS 67260, USA.
| |
Collapse
|
163
|
Zhao Y, Wang J, Guo M, Zhang X, Yu G. Cross-Species Protein Function Prediction with Asynchronous-Random Walk. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1439-1450. [PMID: 31562099 DOI: 10.1109/tcbb.2019.2943342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein function prediction is a fundamental task in the post-genomic era. Available functional annotations of proteins are incomplete and the annotations of two homologous species are complementary to each other. However, how to effectively leverage mutually complementary annotations of different species to further boost the prediction performance is still not well studied. In this paper, we propose a cross-species protein function prediction approach by performing Asynchronous Random Walk on a heterogeneous network (AsyRW). AsyRW first constructs a heterogeneous network to integrate multiple functional association networks derived from different biological data, established homology-relationships between proteins from different species, known annotations of proteins and Gene Ontology (GO). To account for the intrinsic structures of intra- and inter-species of proteins and that of GO, AsyRW quantifies the individual walk lengths of each network node using the gravity-like theory, and then performs asynchronous-random walk with the individual length to predict associations between proteins and GO terms. Experiments on annotations archived in different years show that individual walk length and asynchronous-random walk can effectively leverage the complementary annotations of different species, AsyRW has a significantly improved performance to other related and competitive methods. The codes of AsyRW are available at: http://mlda.swu.edu.cn/codes.php?name=AsyRW.
Collapse
|
164
|
Carrillo-Cabada H, Benson J, Razavi AM, Mulligan B, Cuendet MA, Weinstein H, Taufer M, Estrada T. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1336-1349. [PMID: 31603792 PMCID: PMC9119144 DOI: 10.1109/tcbb.2019.2945291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In order to successfully predict a proteins function throughout its trajectory, in addition to uncovering changes in its conformational state, it is necessary to employ techniques that maintain its 3D information while performing at scale. We extend a protein representation that encodes secondary and tertiary structure into fix-sized, color images, and a neural network architecture (called GEM-net) that leverages our encoded representation. We show the applicability of our method in two ways: (1) performing protein function prediction, hitting accuracy between 78 and 83 percent, and (2) visualizing and detecting conformational changes in protein trajectories during molecular dynamics simulations.
Collapse
|
165
|
Schaffer LV, Ideker T. Mapping the multiscale structure of biological systems. Cell Syst 2021; 12:622-635. [PMID: 34139169 PMCID: PMC8245186 DOI: 10.1016/j.cels.2021.05.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 05/04/2021] [Accepted: 05/14/2021] [Indexed: 01/14/2023]
Abstract
Biological systems are by nature multiscale, consisting of subsystems that factor into progressively smaller units in a deeply hierarchical structure. At any level of the hierarchy, an ever-increasing diversity of technologies can be applied to characterize the corresponding biological units and their relations, resulting in large networks of physical or functional proximities-e.g., proximities of amino acids within a protein, of proteins within a complex, or of cell types within a tissue. Here, we review general concepts and progress in using network proximity measures as a basis for creation of multiscale hierarchical maps of biological systems. We discuss the functionalization of these maps to create predictive models, including those useful in translation of genotype to phenotype, along with strategies for model visualization and challenges faced by multiscale modeling in the near future. Collectively, these approaches enable a unified hierarchical approach to biological data, with application from the molecular to the macroscopic.
Collapse
Affiliation(s)
- Leah V Schaffer
- Division of Genetics, Department of Medicine, University of California San Diego, San Diego, La Jolla, CA 92093, USA
| | - Trey Ideker
- Division of Genetics, Department of Medicine, University of California San Diego, San Diego, La Jolla, CA 92093, USA.
| |
Collapse
|
166
|
Lou P, Dong Y, Jimeno Yepes A, Li C. A representation model for biological entities by fusing structured axioms with unstructured texts. Bioinformatics 2021; 37:1156-1163. [PMID: 33107905 DOI: 10.1093/bioinformatics/btaa913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 09/04/2020] [Accepted: 10/13/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Structured semantic resources, for example, biological knowledge bases and ontologies, formally define biological concepts, entities and their semantic relationships, manifested as structured axioms and unstructured texts (e.g. textual definitions). The resources contain accurate expressions of biological reality and have been used by machine-learning models to assist intelligent applications like knowledge discovery. The current methods use both the axioms and definitions as plain texts in representation learning (RL). However, since the axioms are machine-readable while the natural language is human-understandable, difference in meaning of token and structure impedes the representations to encode desirable biological knowledge. RESULTS We propose ERBK, a RL model of bio-entities. Instead of using the axioms and definitions as a textual corpus, our method uses knowledge graph embedding method and deep convolutional neural models to encode the axioms and definitions respectively. The representations could not only encode more underlying biological knowledge but also be further applied to zero-shot circumstance where existing approaches fall short. Experimental evaluations show that ERBK outperforms the existing methods for predicting protein-protein interactions and gene-disease associations. Moreover, it shows that ERBK still maintains promising performance under the zero-shot circumstance. We believe the representations and the method have certain generality and could extend to other types of bio-relation. AVAILABILITY AND IMPLEMENTATION The source code is available at the gitlab repository https://gitlab.com/BioAI/erbk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peiliang Lou
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,Key Laboratory of Intelligent Networks and Network Security (Xi'an Jiaotong University), Ministry of Education, Xi'an, Shaanxi 710049, China
| | - YuXin Dong
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | | | - Chen Li
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| |
Collapse
|
167
|
Chen H, Shaw D, Bu D, Jiang T. FINER: enhancing the prediction of tissue-specific functions of isoforms by refining isoform interaction networks. NAR Genom Bioinform 2021; 3:lqab057. [PMID: 34169280 PMCID: PMC8219044 DOI: 10.1093/nargab/lqab057] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 05/18/2021] [Accepted: 06/03/2021] [Indexed: 12/24/2022] Open
Abstract
Annotating the functions of gene products is a mainstay in biology. A variety of databases have been established to record functional knowledge at the gene level. However, functional annotations at the isoform resolution are in great demand in many biological applications. Although critical information in biological processes such as protein-protein interactions (PPIs) is often used to study gene functions, it does not directly help differentiate the functions of isoforms, as the 'proteins' in the existing PPIs generally refer to 'genes'. On the other hand, the prediction of isoform functions and prediction of isoform-isoform interactions, though inherently intertwined, have so far been treated as independent computational problems in the literature. Here, we present FINER, a unified framework to jointly predict isoform functions and refine PPIs from the gene level to the isoform level, enabling both tasks to benefit from each other. Extensive computational experiments on human tissue-specific data demonstrate that FINER is able to gain at least 5.16% in AUC and 15.1% in AUPRC for functional prediction across multiple tissues by refining noisy PPIs, resulting in significant improvement over the state-of-the-art methods. Some in-depth analyses reveal consistency between FINER's predictions and the tissue specificity as well as subcellular localization of isoforms.
Collapse
Affiliation(s)
- Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Dipan Shaw
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Dongbo Bu
- Key Lab of Intelligent Information Process, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
168
|
Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021; 12:3168. [PMID: 34039967 PMCID: PMC8155034 DOI: 10.1038/s41467-021-23303-9] [Citation(s) in RCA: 346] [Impact Index Per Article: 86.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 04/22/2021] [Indexed: 02/04/2023] Open
Abstract
The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .
Collapse
|
169
|
Siraj A, Lim DY, Tayara H, Chong KT. UbiComb: A Hybrid Deep Learning Model for Predicting Plant-Specific Protein Ubiquitylation Sites. Genes (Basel) 2021; 12:genes12050717. [PMID: 34064731 PMCID: PMC8151217 DOI: 10.3390/genes12050717] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 05/06/2021] [Accepted: 05/07/2021] [Indexed: 12/11/2022] Open
Abstract
Protein ubiquitylation is an essential post-translational modification process that performs a critical role in a wide range of biological functions, even a degenerative role in certain diseases, and is consequently used as a promising target for the treatment of various diseases. Owing to the significant role of protein ubiquitylation, these sites can be identified by enzymatic approaches, mass spectrometry analysis, and combinations of multidimensional liquid chromatography and tandem mass spectrometry. However, these large-scale experimental screening techniques are time consuming, expensive, and laborious. To overcome the drawbacks of experimental methods, machine learning and deep learning-based predictors were considered for prediction in a timely and cost-effective manner. In the literature, several computational predictors have been published across species; however, predictors are species-specific because of the unclear patterns in different species. In this study, we proposed a novel approach for predicting plant ubiquitylation sites using a hybrid deep learning model by utilizing convolutional neural network and long short-term memory. The proposed method uses the actual protein sequence and physicochemical properties as inputs to the model and provides more robust predictions. The proposed predictor achieved the best result with accuracy values of 80% and 81% and F-scores of 79% and 82% on the 10-fold cross-validation and an independent dataset, respectively. Moreover, we also compared the testing of the independent dataset with popular ubiquitylation predictors; the results demonstrate that our model significantly outperforms the other methods in prediction classification results.
Collapse
Affiliation(s)
- Arslan Siraj
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Dae Yeong Lim
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.S.); (D.Y.L.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|
170
|
Giri SJ, Dutta P, Halani P, Saha S. MultiPredGO: Deep Multi-Modal Protein Function Prediction by Amalgamating Protein Structure, Sequence, and Interaction Information. IEEE J Biomed Health Inform 2021; 25:1832-1838. [PMID: 32897865 DOI: 10.1109/jbhi.2020.3022806] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein is an essential macro-nutrient for perceiving a wide range of biochemical activities and biological regulations in living cells. In this work, we have presented a novel multi-modal approach, named MultiPredGO, for predicting protein functions by utilizing two different kinds of information, namely protein sequence and the protein secondary structure. Here, our contributions are threefold; firstly, along with the protein sequence, we learn the feature representation from the protein structure. Secondly, we develop two different deep learning models after considering the characteristics of the underlying data patterns of the protein sequence and protein 3D structures. Finally, along with these two modalities, we have also utilized protein interaction information for expediting the efficiency of the proposed model in predicting the protein functions. For extracting features from different modalities, we have utilized various variations of the convolutional neural network. As the protein function classes are dependent on each other, we have used a neuro-symbolic hierarchical classification model, which resembles the structure of Gene Ontology (GO), for effectively predicting the dependent protein functions. Finally, to validate the goodness of our proposed method (MultiPredGO), we have compared our results with various uni-modal along with two well-known multi-modal protein function prediction approaches, namely, INGA and DeepGO. Results show that the overall performance of the proposed approach in terms of accuracy, F-measure, precision, and recall metrics are better than those by the state-of-the-art methods. MultiPredGO attains an average 13.05% and 30.87% improvements over the best existing comparing approach (DeepGO) for cellular component and molecular functions, respectively.
Collapse
|
171
|
Villegas-Morcillo A, Makrodimitris S, van Ham RCHJ, Gomez AM, Sanchez V, Reinders MJT. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021; 37:162-170. [PMID: 32797179 PMCID: PMC8055213 DOI: 10.1093/bioinformatics/btaa701] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 07/10/2020] [Accepted: 08/12/2020] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Amelia Villegas-Morcillo
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C H J van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Angel M Gomez
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Victoria Sanchez
- Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
172
|
Abstract
INTRODUCTION Knowledge graphs have proven to be promising systems of information storage and retrieval. Due to the recent explosion of heterogeneous multimodal data sources generated in the biomedical domain, and an industry shift toward a systems biology approach, knowledge graphs have emerged as attractive methods of data storage and hypothesis generation. AREAS COVERED In this review, the author summarizes the applications of knowledge graphs in drug discovery. They evaluate their utility; differentiating between academic exercises in graph theory, and useful tools to derive novel insights, highlighting target identification and drug repurposing as two areas showing particular promise. They provide a case study on COVID-19, summarizing the research that used knowledge graphs to identify repurposable drug candidates. They describe the dangers of degree and literature bias, and discuss mitigation strategies. EXPERT OPINION Whilst knowledge graphs and graph-based machine learning have certainly shown promise, they remain relatively immature technologies. Many popular link prediction algorithms fail to address strong biases in biomedical data, and only highlight biological associations, failing to model causal relationships in complex dynamic biological systems. These problems need to be addressed before knowledge graphs reach their true potential in drug discovery.
Collapse
Affiliation(s)
- Finlay MacLean
- Target Identification., BenevolentAI, United Kingdom of Great Britain and Northern Ireland
| |
Collapse
|
173
|
Liu C, Han Z, Zhang ZK, Nussinov R, Cheng F. A network-based deep learning methodology for stratification of tumor mutations. Bioinformatics 2021; 37:82-88. [PMID: 33416857 PMCID: PMC8034530 DOI: 10.1093/bioinformatics/btaa1099] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 11/23/2020] [Accepted: 12/28/2020] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Tumor stratification has a wide range of biomedical and clinical applications, including diagnosis, prognosis and personalized treatment. However, cancer is always driven by the combination of mutated genes, which are highly heterogeneous across patients. Accurately subdividing the tumors into subtypes is challenging. RESULTS We developed a network-embedding based stratification (NES) methodology to identify clinically relevant patient subtypes from large-scale patients' somatic mutation profiles. The central hypothesis of NES is that two tumors would be classified into the same subtypes if their somatic mutated genes located in the similar network regions of the human interactome. We encoded the genes on the human protein-protein interactome with a network embedding approach and constructed the patients' vectors by integrating the somatic mutation profiles of 7344 tumor exomes across 15 cancer types. We firstly adopted the lightGBM classification algorithm to train the patients' vectors. The AUC value is around 0.89 in the prediction of the patient's cancer type and around 0.78 in the prediction of the tumor stage within a specific cancer type. The high classification accuracy suggests that network embedding-based patients' features are reliable for dividing the patients. We conclude that we can cluster patients with a specific cancer type into several subtypes by using an unsupervised clustering algorithm to learn the patients' vectors. Among the 15 cancer types, the new patient clusters (subtypes) identified by the NES are significantly correlated with patient survival across 12 cancer types. In summary, this study offers a powerful network-based deep learning methodology for personalized cancer medicine. AVAILABILITY AND IMPLEMENTATION Source code and data can be downloaded from https://github.com/ChengF-Lab/NES. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chuang Liu
- Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China
| | - Zhen Han
- Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China
| | - Zi-Ke Zhang
- Alibaba Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China
- College of Media and International Culture, Zhejiang University, Hangzhou 310028, China
| | - Ruth Nussinov
- Cancer and Inflammation Program, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research, National Cancer Institute at Frederick, Frederick, MD 21702, USA
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| |
Collapse
|
174
|
An integrated deep learning and dynamic programming method for predicting tumor suppressor genes, oncogenes, and fusion from PDB structures. Comput Biol Med 2021; 133:104323. [PMID: 33934067 DOI: 10.1016/j.compbiomed.2021.104323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Revised: 02/18/2021] [Accepted: 03/07/2021] [Indexed: 11/20/2022]
Abstract
Mutations in proto-oncogenes (ONGO) and the loss of regulatory function of tumor suppression genes (TSG) are the common underlying mechanism for uncontrolled tumor growth. While cancer is a heterogeneous complex of distinct diseases, finding the potentiality of the genes related functionality to ONGO or TSG through computational studies can help develop drugs that target the disease. This paper proposes a classification method that starts with a preprocessing stage to extract the feature map sets from the input 3D protein structural information. The next stage is a deep convolutional neural network stage (DCNN) that outputs the probability of functional classification of genes. We explored and tested two approaches: in Approach 1, all filtered and cleaned 3D-protein-structures (PDB) are pooled together, whereas in Approach 2, the primary structures and their corresponding PDBs are separated according to the genes' primary structural information. Following the DCNN stage, a dynamic programming-based method is used to determine the final prediction of the primary structures' functionality. We validated our proposed method using the COSMIC online database. For the ONGO vs TSG classification problem the AUROC of the DCNN stage for Approach 1 and Approach 2 DCNN are 0.978 and 0.765, respectively. The AUROCs of the final genes' primary structure functionality classification for Approach 1 and Approach 2 are 0.989, and 0.879, respectively. For comparison, the current state-of-the-art reported AUROC is 0.924. Our results warrant further study to apply the deep learning models to humans' (GRCh38) genes, for predicting their corresponding probabilities of functionality in the cancer drivers.
Collapse
|
175
|
Ma Y, Li Q, Hu N, Li L. SeBioGraph: Semi-supervised Deep Learning for the Graph via Sustainable Knowledge Transfer. Front Neurorobot 2021; 15:665055. [PMID: 33867966 PMCID: PMC8047129 DOI: 10.3389/fnbot.2021.665055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Accepted: 03/09/2021] [Indexed: 11/17/2022] Open
Abstract
Semi-supervised deep learning for the biomedical graph and advanced manufacturing graph is rapidly becoming an important topic in both academia and industry. Many existing types of research focus on semi-supervised link prediction and node classification, as well as the application of these methods in sustainable development and advanced manufacturing. To date, most manufacturing graph neural networks are mainly evaluated on social and information networks, which improve the quality of network representation y integrating neighbor node descriptions. However, previous methods have not yet been comprehensively studied on biomedical networks. Traditional techniques fail to achieve satisfying results, especially when labeled nodes are deficient in number. In this paper, a new semi-supervised deep learning method for the biomedical graph via sustainable knowledge transfer called SeBioGraph is proposed. In SeBioGraph, both node embedding and graph-specific prototype embedding are utilized as transferable metric space characterized. By incorporating prior knowledge learned from auxiliary graphs, SeBioGraph further promotes the performance of the target graph. Experimental results on the two-class node classification tasks and three-class link prediction tasks demonstrate that the SeBioGraph realizes state-of-the-art results. Finally, the method is thoroughly evaluated.
Collapse
Affiliation(s)
- Yugang Ma
- School of Architecture and Urban Planning, Chongqing University, Chongqing, China
| | - Qing Li
- School of Computer Science, Northwestern Polytechnical University, Shaanxi, China
| | - Nan Hu
- School of Management Science and Real Estate, Chongqing University, Chongqing, China
| | - Lili Li
- China Construction Science & Technology Group Co., Ltd. Shenzhen, China.,College of Civil and Environmental Engineering, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
176
|
Cao Y, Shen Y. TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding. Bioinformatics 2021; 37:2825-2833. [PMID: 33755048 PMCID: PMC8479653 DOI: 10.1093/bioinformatics/btab198] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 02/08/2021] [Accepted: 03/22/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. RESULTS To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence-Label Embedding (TALE). For generalizability to novel sequences we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided. AVAILABILITY AND IMPLEMENTATION The data, source codes and models are available at https://github.com/Shen-Lab/TALE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yue Cao
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA,To whom correspondence should be addressed.
| |
Collapse
|
177
|
Yusuf SM, Zhang F, Zeng M, Li M. DeepPPF: A deep learning framework for predicting protein family. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.11.062] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
178
|
Seyyedsalehi SF, Soleymani M, Rabiee HR, Mofrad MRK. PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks. PLoS One 2021; 16:e0244430. [PMID: 33630862 PMCID: PMC7906332 DOI: 10.1371/journal.pone.0244430] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Accepted: 12/09/2020] [Indexed: 12/12/2022] Open
Abstract
Understanding the functionality of proteins has emerged as a critical problem in recent years due to significant roles of these macro-molecules in biological mechanisms. However, in-laboratory techniques for protein function prediction are not as efficient as methods developed and processed for protein sequencing. While more than 70 million protein sequences are available today, only the functionality of around one percent of them are known. These facts have encouraged researchers to develop computational methods to infer protein functionalities from their sequences. Gene Ontology is the most well-known database for protein functions which has a hierarchical structure, where deeper terms are more determinative and specific. However, the lack of experimentally approved annotations for these specific terms limits the performance of computational methods applied on them. In this work, we propose a method to improve protein function prediction using their sequences by deeply extracting relationships between Gene Ontology terms. To this end, we construct a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the annotation process. In addition to the baseline algorithms, we compare our method with two recently proposed deep techniques that attempt to utilize Gene Ontology term correlations. Our results confirm the superiority of the proposed method compared to the previous works. Moreover, we demonstrate how our model can effectively help to assign more specific terms to sequences.
Collapse
Affiliation(s)
- Seyyede Fatemeh Seyyedsalehi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| | - Mahdieh Soleymani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Hamid R. Rabiee
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Mohammad R. K. Mofrad
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| |
Collapse
|
179
|
Zohra Smaili F, Tian S, Roy A, Alazmi M, Arold ST, Mukherjee S, Scott Hefty P, Chen W, Gao X. QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:998-1011. [PMID: 33631427 PMCID: PMC9403031 DOI: 10.1016/j.gpb.2021.02.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 04/03/2019] [Accepted: 05/17/2019] [Indexed: 11/25/2022]
Abstract
The number of available protein sequences in public databases is increasing exponentially. However, a significant percentage of these sequences lack functional annotation, which is essential for the understanding of how biological systems operate. Here, we propose a novel method, Quantitative Annotation of Unknown STructure (QAUST), to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. QAUST uses three sources of information: structure information encoded by global and local structure similarity search, biological network information inferred by protein–protein interaction data, and sequence information extracted from functionally discriminative sequence motifs. These three pieces of information are combined by consensus averaging to make the final prediction. Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation (CAFA) benchmark set. The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading. We further demonstrate that a previously unknown function of human tripartite motif-containing 22 (TRIM22) protein predicted by QAUST can be experimentally validated.
Collapse
Affiliation(s)
- Fatima Zohra Smaili
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Shuye Tian
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China
| | - Ambrish Roy
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Meshari Alazmi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia; College of Computer Science and Engineering, University of Hail, Hail 55476, Saudi Arabia
| | - Stefan T Arold
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Srayanta Mukherjee
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - P Scott Hefty
- Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA
| | - Wei Chen
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China.
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.
| |
Collapse
|
180
|
Makarov I, Kiselev D, Nikitinsky N, Subelj L. Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput Sci 2021; 7:e357. [PMID: 33817007 PMCID: PMC7959646 DOI: 10.7717/peerj-cs.357] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 12/18/2020] [Indexed: 05/13/2023]
Abstract
Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering to incorporate structural information into a predictive model. Nowadays, a family of automated graph feature engineering techniques has been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs and their components, such as nodes, edges and subgraphs under preserving inner graph properties. Using the constructed feature spaces, many machine learning problems on graphs can be solved via standard frameworks suitable for vectorized feature representation. Our survey aims to describe the core concepts of graph embeddings and provide several taxonomies for their description. First, we start with the methodological approach and extract three types of graph embedding models based on matrix factorization, random-walks and deep learning approaches. Next, we describe how different types of networks impact the ability of models to incorporate structural and attributed data into a unified embedding. Going further, we perform a thorough evaluation of graph embedding applications to machine learning problems on graphs, among which are node classification, link prediction, clustering, visualization, compression, and a family of the whole graph embedding algorithms suitable for graph classification, similarity and alignment problems. Finally, we overview the existing applications of graph embeddings to computer science domains, formulate open problems and provide experiment results, explaining how different networks properties result in graph embeddings quality in the four classic machine learning problems on graphs, such as node classification, link prediction, clustering and graph visualization. As a result, our survey covers a new rapidly growing field of network feature engineering, presents an in-depth analysis of models based on network types, and overviews a wide range of applications to machine learning problems on graphs.
Collapse
Affiliation(s)
- Ilya Makarov
- HSE University, Moscow, Russia
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | | | - Nikita Nikitinsky
- Big Data Research Center, National University of Science and Technology MISIS, Moscow, Russia
| | - Lovro Subelj
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| |
Collapse
|
181
|
Cui F, Zhang Z, Zou Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief Funct Genomics 2021; 20:61-73. [PMID: 33527980 DOI: 10.1093/bfgp/elaa030] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 12/16/2020] [Accepted: 12/18/2020] [Indexed: 11/12/2022] Open
Abstract
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Collapse
Affiliation(s)
- Feifei Cui
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Zilong Zhang
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
182
|
Bolleman J, de Castro E, Baratin D, Gehant S, Cuche BA, Auchincloss AH, Coudert E, Hulo C, Masson P, Pedruzzi I, Rivoire C, Xenarios I, Redaschi N, Bridge A. HAMAP as SPARQL rules-A portable annotation pipeline for genomes and proteomes. Gigascience 2021; 9:5731417. [PMID: 32034905 PMCID: PMC7007698 DOI: 10.1093/gigascience/giaa003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 11/30/2019] [Accepted: 01/13/2020] [Indexed: 12/24/2022] Open
Abstract
Background Genome and proteome annotation pipelines are generally custom built and not easily reusable by other groups. This leads to duplication of effort, increased costs, and suboptimal annotation quality. One way to address these issues is to encourage the adoption of annotation standards and technological solutions that enable the sharing of biological knowledge and tools for genome and proteome annotation. Results Here we demonstrate one approach to generate portable genome and proteome annotation pipelines that users can run without recourse to custom software. This proof of concept uses our own rule-based annotation pipeline HAMAP, which provides functional annotation for protein sequences to the same depth and quality as UniProtKB/Swiss-Prot, and the World Wide Web Consortium (W3C) standards Resource Description Framework (RDF) and SPARQL (a recursive acronym for the SPARQL Protocol and RDF Query Language). We translate complex HAMAP rules into the W3C standard SPARQL 1.1 syntax, and then apply them to protein sequences in RDF format using freely available SPARQL engines. This approach supports the generation of annotation that is identical to that generated by our own in-house pipeline, using standard, off-the-shelf solutions, and is applicable to any genome or proteome annotation pipeline. Conclusions HAMAP SPARQL rules are freely available for download from the HAMAP FTP site, ftp://ftp.expasy.org/databases/hamap/sparql/, under the CC-BY-ND 4.0 license. The annotations generated by the rules are under the CC-BY 4.0 license. A tutorial and supplementary code to use HAMAP as SPARQL are available on GitHub at https://github.com/sib-swiss/HAMAP-SPARQL, and general documentation about HAMAP can be found on the HAMAP website at https://hamap.expasy.org.
Collapse
Affiliation(s)
- Jerven Bolleman
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Delphine Baratin
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Sebastien Gehant
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Beatrice A Cuche
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Andrea H Auchincloss
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Elisabeth Coudert
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Chantal Hulo
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Patrick Masson
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Ioannis Xenarios
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland.,Centre Hospitalier Universitaire Vaudois/Ludwig Institute for Cancer Research, Agora Centre, CH-1005 Lausanne, Switzerland
| | - Nicole Redaschi
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| | - Alan Bridge
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1 rue Michel-Servet, CH-1211 Geneva 4, Switzerland
| |
Collapse
|
183
|
Hou Y, Zhang X, Zhou Q, Hong W, Wang Y. Hierarchical Microbial Functions Prediction by Graph Aggregated Embedding. Front Genet 2021; 11:608512. [PMID: 33584804 PMCID: PMC7874084 DOI: 10.3389/fgene.2020.608512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 11/20/2020] [Indexed: 02/01/2023] Open
Abstract
Matching 16S rRNA gene sequencing data to a metabolic reference database is a meaningful way to predict the metabolic function of bacteria and archaea, bringing greater insight to the working of the microbial community. However, some operational taxonomy units (OTUs) cannot be functionally profiled, especially for microbial communities from non-human samples cultured in defective media. Therefore, we herein report the development of Hierarchical micrObial functions Prediction by graph aggregated Embedding (HOPE), which utilizes co-occurring patterns and nucleotide sequences to predict microbial functions. HOPE integrates topological structures of microbial co-occurrence networks with k-mer compositions of OTU sequences and embeds them into a lower-dimensional continuous latent space, while maximally preserving topological relationships among OTUs. The high imbalance among KEGG Orthology (KO) functions of microbes is recognized in our framework that usually yields poor performance. A hierarchical multitask learning module is used in HOPE to alleviate the challenge brought by the long-tailed distribution among classes. To test the performance of HOPE, we compare it with HOPE-one, HOPE-seq, and GraphSAGE, respectively, in three microbial metagenomic 16s rRNA sequencing datasets, including abalone gut, human gut, and gut of Penaeus monodon. Experiments demonstrate that HOPE outperforms baselines on almost all indexes in all experiments. Furthermore, HOPE reveals significant generalization ability. HOPE's basic idea is suitable for other related scenarios, such as the prediction of gene function based on gene co-expression networks. The source code of HOPE is freely available at https://github.com/adrift00/HOPE.
Collapse
Affiliation(s)
- Yujie Hou
- Department of Automation, Xiamen University, Xiamen, China.,Department of Automation, University of Science and Technology of China, Hefei, China
| | - Xiong Zhang
- Department of Automation, Xiamen University, Xiamen, China.,School of Automation Science and Engineering, South China University of Technology, Guangzhou, China
| | - Qinyan Zhou
- Department of Automation, Xiamen University, Xiamen, China.,Institute of AI and Robotics, Fudan University, Shanghai, China
| | - Wenxing Hong
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China.,Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen, China.,Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, Xiamen, China
| |
Collapse
|
184
|
Shaw D, Chen H, Xie M, Jiang T. DeepLPI: a multimodal deep learning method for predicting the interactions between lncRNAs and protein isoforms. BMC Bioinformatics 2021; 22:24. [PMID: 33461501 PMCID: PMC7814738 DOI: 10.1186/s12859-020-03914-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Accepted: 11/30/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) regulate diverse biological processes via interactions with proteins. Since the experimental methods to identify these interactions are expensive and time-consuming, many computational methods have been proposed. Although these computational methods have achieved promising prediction performance, they neglect the fact that a gene may encode multiple protein isoforms and different isoforms of the same gene may interact differently with the same lncRNA. RESULTS In this study, we propose a novel method, DeepLPI, for predicting the interactions between lncRNAs and protein isoforms. Our method uses sequence and structure data to extract intrinsic features and expression data to extract topological features. To combine these different data, we adopt a hybrid framework by integrating a multimodal deep learning neural network and a conditional random field. To overcome the lack of known interactions between lncRNAs and protein isoforms, we apply a multiple instance learning (MIL) approach. In our experiment concerning the human lncRNA-protein interactions in the NPInter v3.0 database, DeepLPI improved the prediction performance by 4.7% in term of AUC and 5.9% in term of AUPRC over the state-of-the-art methods. Our further correlation analyses between interactive lncRNAs and protein isoforms also illustrated that their co-expression information helped predict the interactions. Finally, we give some examples where DeepLPI was able to outperform the other methods in predicting mouse lncRNA-protein interactions and novel human lncRNA-protein interactions. CONCLUSION Our results demonstrated that the use of isoforms and MIL contributed significantly to the improvement of performance in predicting lncRNA and protein interactions. We believe that such an approach would find more applications in predicting other functional roles of RNAs and proteins.
Collapse
Affiliation(s)
- Dipan Shaw
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521 USA
| | - Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521 USA
| | - Minzhu Xie
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521 USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
185
|
Zhou G, Wang J, Zhang X, Guo M, Yu G. Predicting functions of maize proteins using graph convolutional network. BMC Bioinformatics 2020; 21:420. [PMID: 33323113 PMCID: PMC7739465 DOI: 10.1186/s12859-020-03745-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Maize (Zea mays ssp. mays L.) is the most widely grown and yield crop in the world, as well as an important model organism for fundamental research of the function of genes. The functions of Maize proteins are annotated using the Gene Ontology (GO), which has more than 40000 terms and organizes GO terms in a direct acyclic graph (DAG). It is a huge challenge to accurately annotate relevant GO terms to a Maize protein from such a large number of candidate GO terms. Some deep learning models have been proposed to predict the protein function, but the effectiveness of these approaches is unsatisfactory. One major reason is that they inadequately utilize the GO hierarchy. Results To use the knowledge encoded in the GO hierarchy, we propose a deep Graph Convolutional Network (GCN) based model (DeepGOA) to predict GO annotations of proteins. DeepGOA firstly quantifies the correlations (or edges) between GO terms and updates the edge weights of the DAG by leveraging GO annotations and hierarchy, then learns the semantic representation and latent inter-relations of GO terms in the way by applying GCN on the updated DAG. Meanwhile, Convolutional Neural Network (CNN) is used to learn the feature representation of amino acid sequences with respect to the semantic representations. After that, DeepGOA computes the dot product of the two representations, which enable to train the whole network end-to-end coherently. Extensive experiments show that DeepGOA can effectively integrate GO structural information and amino acid information, and then annotates proteins accurately. Conclusions Experiments on Maize PH207 inbred line and Human protein sequence dataset show that DeepGOA outperforms the state-of-the-art deep learning based methods. The ablation study proves that GCN can employ the knowledge of GO and boost the performance. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=DeepGOA.
Collapse
Affiliation(s)
- Guangjie Zhou
- School of Software, Shandong University, Jinan, China.,College of Computer and Information Sciences, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Chongqing, China
| | - Xiangliang Zhang
- CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.
| | - Guoxian Yu
- School of Software, Shandong University, Jinan, China. .,College of Computer and Information Sciences, Chongqing, China. .,CEMSE, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
186
|
DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier. PLoS Comput Biol 2020; 16:e1008453. [PMID: 33206638 PMCID: PMC7710064 DOI: 10.1371/journal.pcbi.1008453] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 12/02/2020] [Accepted: 10/20/2020] [Indexed: 12/21/2022] Open
Abstract
Predicting the phenotypes resulting from molecular perturbations is one of the key challenges in genetics. Both forward and reverse genetic screen are employed to identify the molecular mechanisms underlying phenotypes and disease, and these resulted in a large number of genotype–phenotype association being available for humans and model organisms. Combined with recent advances in machine learning, it may now be possible to predict human phenotypes resulting from particular molecular aberrations. We developed DeepPheno, a neural network based hierarchical multi-class multi-label classification method for predicting the phenotypes resulting from loss-of-function in single genes. DeepPheno uses the functional annotations with gene products to predict the phenotypes resulting from a loss-of-function; additionally, we employ a two-step procedure in which we predict these functions first and then predict phenotypes. Prediction of phenotypes is ontology-based and we propose a novel ontology-based classifier suitable for very large hierarchical classification tasks. These methods allow us to predict phenotypes associated with any known protein-coding gene. We evaluate our approach using evaluation metrics established by the CAFA challenge and compare with top performing CAFA2 methods as well as several state of the art phenotype prediction approaches, demonstrating the improvement of DeepPheno over established methods. Furthermore, we show that predictions generated by DeepPheno are applicable to predicting gene–disease associations based on comparing phenotypes, and that a large number of new predictions made by DeepPheno have recently been added as phenotype databases. Gene–phenotype associations can help to understand the underlying mechanisms of many genetic diseases. However, experimental identification, often involving animal models, is time consuming and expensive. Computational methods that predict gene–phenotype associations can be used instead. We developed DeepPheno, a novel approach for predicting the phenotypes resulting from a loss of function of a single gene. We use gene functions and gene expression as information to prediction phenotypes. Our method uses a neural network classifier that is able to account for hierarchical dependencies between phenotypes. We extensively evaluate our method and compare it with related approaches, and we show that DeepPheno results in better performance in several evaluations. Furthermore, we found that many of the new predictions made by our method have been added to phenotype association databases released one year later. Overall, DeepPheno simulates some aspects of human physiology and how molecular and physiological alterations lead to abnormal phenotypes.
Collapse
|
187
|
Li G, Luo J, Wang D, Liang C, Xiao Q, Ding P, Chen H. Potential circRNA-disease association prediction using DeepWalk and network consistency projection. J Biomed Inform 2020; 112:103624. [PMID: 33217543 DOI: 10.1016/j.jbi.2020.103624] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 11/08/2020] [Accepted: 11/12/2020] [Indexed: 12/11/2022]
Abstract
A growing body of experimental studies have reported that circular RNAs (circRNAs) are of interest in pathogenicity mechanism research and are becoming new diagnostic biomarkers. As experimental techniques for identifying disease-circRNA interactions are costly and laborious, some computational predictors have been advanced on the basis of the integration of biological features about circRNAs and diseases. However, the existing circRNA-disease relationships are not well exploited. To solve this issue, a novel method named DeepWalk and network consistency projection for circRNA-disease association prediction (DWNCPCDA) is proposed. Specifically, our method first reveals features of nodes learned by the deep learning method DeepWalk based on known circRNA-disease associations to calculate circRNA-circRNA similarity and disease-disease similarity, and then these two similarity networks are further employed to feed to the network consistency projection method to predict unobserved circRNA-disease interactions. As a result, DWNCPCDA shows high-accuracy performances for disease-circRNA interaction prediction: an AUC of 0.9647 with leave-one-out cross validation and an average AUC of 0.9599 with five-fold cross validation. We further perform case studies to prioritize latent circRNAs related to complex human diseases. Overall, this proposed method is able to provide a promising solution for disease-circRNA interaction prediction, and is capable of enhancing existing similarity-based prediction methods.
Collapse
Affiliation(s)
- Guanghui Li
- School of Information Engineering, East China Jiaotong University, Nanchang, China.
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Diancheng Wang
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Cheng Liang
- College of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Qiu Xiao
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, China
| | - Hailin Chen
- School of Software, East China Jiaotong University, Nanchang, China
| |
Collapse
|
188
|
Foroozandeh Shahraki M, Farhadyar K, Kavousi K, Azarabad MH, Boroomand A, Ariaeenejad S, Hosseini Salekdeh G. A generalized machine-learning aided method for targeted identification of industrial enzymes from metagenome: A xylanase temperature dependence case study. Biotechnol Bioeng 2020; 118:759-769. [PMID: 33095441 DOI: 10.1002/bit.27608] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 09/23/2020] [Accepted: 10/11/2020] [Indexed: 11/08/2022]
Abstract
Growing industrial utilization of enzymes and the increasing availability of metagenomic data highlight the demand for effective methods of targeted identification and verification of novel enzymes from various environmental microbiota. Xylanases are a class of enzymes with numerous industrial applications and are involved in the degradation of xylose, a component of lignocellulose. The optimum temperature of enzymes is an essential factor to be considered when choosing appropriate biocatalysts for a particular purpose. Therefore, in silico prediction of this attribute is a significant cost and time-effective step in the effort to characterize novel enzymes. The objective of this study was to develop a computational method to predict the thermal dependence of xylanases. This tool was then implemented for targeted screening of putative xylanases with specific thermal dependencies from metagenomic data and resulted in the identification of three novel xylanases from sheep and cow rumen microbiota. Here we present thermal activity prediction for xylanase, a new sequence-based machine learning method that has been trained using a selected combination of various protein features. This random forest classifier discriminates non-thermophilic, thermophilic, and hyper-thermophilic xylanases. The model's performance was evaluated through multiple iterations of sixfold cross-validations as well as holdout tests, and it is freely accessible as a web-service at arimees.com.
Collapse
Affiliation(s)
- Mehdi Foroozandeh Shahraki
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Kiana Farhadyar
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Mohammad H Azarabad
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Amin Boroomand
- School of Natural Sciences, University of California Merced, Merced, California, USA
| | - Shohreh Ariaeenejad
- Department of Systems and Synthetic Biology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREEO), Karaj, Iran
| | - Ghasem Hosseini Salekdeh
- Department of Systems and Synthetic Biology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREEO), Karaj, Iran.,Department of Molecular Sciences, Macquarie University, Sydney, New South Wales, Australia
| |
Collapse
|
189
|
Deng L, Lin W, Wang J, Zhang J. DeepciRGO: functional prediction of circular RNAs through hierarchical deep neural networks using heterogeneous network features. BMC Bioinformatics 2020; 21:519. [PMID: 33183227 PMCID: PMC7659092 DOI: 10.1186/s12859-020-03748-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 09/11/2020] [Indexed: 12/28/2022] Open
Abstract
Background Circular RNAs (circRNAs) are special noncoding RNA molecules with closed loop structures. Compared with the traditional linear RNA, circRNA is more stable and not easily degraded. Many studies have shown that circRNAs are involved in the regulation of various diseases and cancers. Determining the functions of circRNAs in mammalian cells is of great significance for revealing their mechanism of action in physiological and pathological processes, diagnosis and treatment of diseases. However, determining the functions of circRNAs on a large scale is a challenging task because of the high experimental costs. Results In this paper, we present a hierarchical deep learning model, DeepciRGO, which can effectively predict gene ontology functions of circRNAs. We build a heterogeneous network containing circRNA co-expressions, protein–protein interactions and protein–circRNA interactions. The topology features of proteins and circRNAs are calculated using a novel representation learning approach HIN2Vec across the heterogeneous network. Then, a deep multi-label hierarchical classification model is trained with the topology features to predict the biological process function in the gene ontology for each circRNA. In particular, we manually curated a benchmark dataset containing 185 GO annotations for 62 circRNAs, namely, circRNA2GO-62. The DeepciRGO achieves promising performance on the circRNA2GO-62 dataset with a maximum F-measure of 0.412, a recall score of 0.400, and an accuracy of 0.425, which are significantly better than other state-of-the-art RNA function prediction methods. In addition, we demonstrate the considerable potential of integrating multiple interactions and association networks. Conclusions DeepciRGO will be a useful tool for accurately annotating circRNAs. The experimental results show that integrating multi-source data can help to improve the predictive performance of DeepciRGO. Moreover, The model also can combine RNA structure and sequence information to further optimize predictive performance.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Wei Lin
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Jiacheng Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000, China.
| |
Collapse
|
190
|
Du Z, He Y, Li J, Uversky VN. DeepAdd: Protein function prediction from k-mer embedding and additional features. Comput Biol Chem 2020; 89:107379. [PMID: 33011616 DOI: 10.1016/j.compbiolchem.2020.107379] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2019] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 10/23/2022]
Abstract
With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China.
| | - Yufeng He
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Jianqiang Li
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Guangdong Province, PR China
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, FL, USA; Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Institutskaya Str., 7, Pushchino, Moscow Region, 142290, Russia.
| |
Collapse
|
191
|
Zhang F, Cetin Karayumak S, Hoffmann N, Rathi Y, Golby AJ, O'Donnell LJ. Deep white matter analysis (DeepWMA): Fast and consistent tractography segmentation. Med Image Anal 2020; 65:101761. [PMID: 32622304 PMCID: PMC7483951 DOI: 10.1016/j.media.2020.101761] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 06/16/2020] [Accepted: 06/18/2020] [Indexed: 02/07/2023]
Abstract
White matter tract segmentation, i.e. identifying tractography fibers (streamline trajectories) belonging to anatomically meaningful fiber tracts, is an essential step to enable tract quantification and visualization. In this study, we present a deep learning tractography segmentation method (DeepWMA) that allows fast and consistent identification of 54 major deep white matter fiber tracts from the whole brain. We create a large-scale training tractography dataset of 1 million labeled fiber samples, and we propose a novel 2D multi-channel feature descriptor (FiberMap) that encodes spatial coordinates of points along each fiber. We learn a convolutional neural network (CNN) fiber classification model based on FiberMap and obtain a high fiber classification accuracy of 90.99% on the training tractography data with ground truth fiber labels. Then, the method is evaluated on a test dataset of 597 diffusion MRI scans from six independently acquired populations across genders, the lifespan (1 day - 82 years), and different health conditions (healthy control, neuropsychiatric disorders, and brain tumor patients). We perform comparisons with two state-of-the-art tract segmentation methods. Experimental results show that our method obtains a highly consistent tract segmentation result, where on average over 99% of the fiber tracts are successfully identified across all subjects under study, most importantly, including neonates and patients with space-occupying brain tumors. We also demonstrate good generalization of the method to tractography data from multiple different fiber tracking methods. The proposed method leverages deep learning techniques and provides a fast and efficient tool for brain white matter segmentation in large diffusion MRI tractography datasets.
Collapse
Affiliation(s)
- Fan Zhang
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | | | - Nico Hoffmann
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Yogesh Rathi
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Alexandra J Golby
- Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | | |
Collapse
|
192
|
Dutta P, Mishra P, Saha S. Incomplete multi-view gene clustering with data regeneration using Shape Boltzmann Machine. Comput Biol Med 2020; 125:103965. [PMID: 32931989 DOI: 10.1016/j.compbiomed.2020.103965] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 08/08/2020] [Accepted: 08/08/2020] [Indexed: 11/17/2022]
Abstract
Deciphering patterns in the structural and functional anatomy of genes can prove to be very helpful in understanding genetic biology and genomics. Also, the availability of the multiple omics data, along with the advent of machine learning techniques, aids medical professionals in gaining insights about various biological regulations. Gene clustering is one of the many such computation techniques that can help in understanding gene behavior. However, more comprehensive and reliable insights can be gained if different modalities/views of biomedical data are considered. However, in most multi-view cases, each view contains some missing data, leading to incomplete multi-view clustering. In this study, we have presented a deep Boltzmann machine-based incomplete multi-view clustering framework for gene clustering. Here, we seek to regenerate the data of the three NCBI datasets in the incomplete modalities using Shape Boltzmann Machines. The overall performance of the proposed multi-view clustering technique has been evaluated using the Silhouette index and Davies-Bouldin index, and the comparative analysis shows an improvement over state-of-the-art methods. Finally, to prove that the improvement attained by the proposed incomplete multi-view clustering is statistically significant, we perform Welch's t-test. AVAILABILITY OF DATA AND MATERIALS: https://github.com/piyushmishra12/IMC.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology, Patna, India
| | - Piyush Mishra
- Department of Computer Science and Engineering, IIIT, Bhubaneswar, India
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology, Patna, India.
| |
Collapse
|
193
|
Hew B, Tan QW, Goh W, Ng JWX, Mutwil M. LSTrAP-Crowd: prediction of novel components of bacterial ribosomes with crowd-sourced analysis of RNA sequencing data. BMC Biol 2020; 18:114. [PMID: 32883264 PMCID: PMC7470450 DOI: 10.1186/s12915-020-00846-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 08/12/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Bacterial resistance to antibiotics is a growing health problem that is projected to cause more deaths than cancer by 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the structurally conserved bacterial ribosomes, factors involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. Here, we use a bioinformatics approach to identify novel components of protein synthesis. RESULTS In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. CONCLUSIONS We identified genes related to protein synthesis in common bacterial pathogens and thus provide a resource of potential antibiotic development targets for experimental validation. The data can be used to explore additional vulnerabilities of bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowd-sourced.
Collapse
Affiliation(s)
- Benedict Hew
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Qiao Wen Tan
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - William Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Jonathan Wei Xiong Ng
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
194
|
Ranjan A, Fahad MS, Fernandez-Baca D, Deepak A, Tripathi S. Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1648-1659. [PMID: 30998479 DOI: 10.1109/tcbb.2019.2911609] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The order of amino acids in a protein sequence enables the protein to acquire a conformation suitable for performing functions, thereby motivating the need to analyze these sequences for predicting functions. Although machine learning based approaches are fast compared to methods using BLAST, FASTA, etc., they fail to perform well for long protein sequences (with more than 300 amino acids). In this paper, we introduce a novel method for construction of two separate feature sets for protein using bi-directional long short-term memory network based on the analysis of fixed 1) single-sized segments and 2) multi-sized segments. The model trained on the proposed feature set based on multi-sized segments is combined with the model trained using state-of-the-art Multi-label Linear Discriminant Analysis (MLDA) features to further improve the accuracy. Extensive evaluations using separate datasets for biological processes and molecular functions demonstrate not only improved results for long sequences, but also significantly improve the overall accuracy over state-of-the-art method. The single-sized approach produces an improvement of +3.37 percent for biological processes and +5.48 percent for molecular functions over the MLDA based classifier. The corresponding numbers for multi-sized approach are +5.38 and +8.00 percent. Combining the two models, the accuracy further improves to +7.41 and +9.21 percent, respectively.
Collapse
|
195
|
Fan K, Zhang Y. Pseudo2GO: A Graph-Based Deep Learning Method for Pseudogene Function Prediction by Borrowing Information From Coding Genes. Front Genet 2020; 11:807. [PMID: 33014009 PMCID: PMC7461887 DOI: 10.3389/fgene.2020.00807] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 07/06/2020] [Indexed: 12/16/2022] Open
Abstract
Pseudogenes are indicating more and more functional potentials recently, though historically were regarded as relics of evolution. Computational methods for predicting pseudogene functions on Gene Ontology is important for directing experimental discovery. However, no pseudogene-specific computational methods have been proposed to directly predict their Gene Ontology (GO) terms. The biggest challenge for pseudogene function prediction is the lack of enough features and functional annotations, making training a predictive model difficult. Considering the close functional similarity between pseudogenes and their parent coding genes that share great amount of DNA sequence, as well as that coding genes have rich annotations, we aim to predict pseudogene functions by borrowing information from coding genes in a graph-based way. Here we propose Pseudo2GO, a graph-based deep learning semi-supervised model for pseudogene function prediction. A sequence similarity graph is first constructed to connect pseudogenes and coding genes. Multiple features are incorporated into the model as the node attributes to enable the graph an attributed graph, including expression profiles, interactions with microRNAs, protein-protein interactions (PPIs), and genetic interactions. Graph convolutional networks are used to propagate node attributes across the graph to make classifications on pseudogenes. Comparing Pseudo2GO with other frameworks adapted from popular protein function prediction methods, we demonstrated that our method has achieved state-of-the-art performance, significantly outperforming other methods in terms of the M-AUPR metric.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
- The Ohio State University Comprehensive Cancer Center, Columbus, OH, United States
| |
Collapse
|
196
|
Saha S, Prasad A, Chatterjee P, Basu S, Nasipuri M. Protein function prediction from dynamic protein interaction network using gene expression data. J Bioinform Comput Biol 2020; 17:1950025. [PMID: 31617461 DOI: 10.1142/s0219720019500252] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Computational prediction of functional annotation of proteins is an uphill task. There is an ever increasing gap between functional characterization of protein sequences and deluge of protein sequences generated by large-scale sequencing projects. The dynamic nature of protein interactions is frequently observed which is mostly influenced by any new change of state or change in stimuli. Functional characterization of proteins can be inferred from their interactions with each other, which is dynamic in nature. In this work, we have used a dynamic protein-protein interaction network (PPIN), time course gene expression data and protein sequence information for prediction of functional annotation of proteins. During progression of a particular function, it has also been observed that not all the proteins are active at all time points. For unannotated active proteins, our proposed methodology explores the dynamic PPIN consisting of level-1 and level-2 neighboring proteins at different time points, filtered by Damerau-Levenshtein edit distance to estimate the similarity between two protein sequences and coefficient variation methods to assess the strength of an edge in a network. Finally, from the filtered dynamic PPIN, at each time point, functional annotations of the level-2 proteins are assigned to the unknown and unannotated active proteins through the level-1 neighbor, following a bottom-up strategy. Our proposed methodology achieves an average precision, recall and F-Score of 0.59, 0.76 and 0.61 respectively, which is significantly higher than the reported state-of-the-art methods.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Abhimanyu Prasad
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| |
Collapse
|
197
|
Fan K, Guan Y, Zhang Y. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions. Gigascience 2020; 9:giaa081. [PMID: 32770210 PMCID: PMC7414417 DOI: 10.1093/gigascience/giaa081] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 04/30/2020] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Identifying protein functions is important for many biological applications. Since experimental functional characterization of proteins is time-consuming and costly, accurate and efficient computational methods for predicting protein functions are in great demand for generating the testable hypotheses guiding large-scale experiments." RESULTS Here, we propose Graph2GO, a multi-modal graph-based representation learning model that can integrate heterogeneous information, including multiple types of interaction networks (sequence similarity network and protein-protein interaction network) and protein features (amino acid sequence, subcellular location, and protein domains) to predict protein functions on gene ontology. Comparing Graph2GO to BLAST, as a baseline model, and to two popular protein function prediction methods (Mashup and deepNF), we demonstrated that our model can achieve state-of-the-art performance. We show the robustness of our model by testing on multiple species. We also provide a web server supporting function query and downstream analysis on-the-fly. CONCLUSIONS Graph2GO is the first model that has utilized attributed network representation learning methods to model both interaction networks and protein features for predicting protein functions, and achieved promising performance. Our model can be easily extended to include more protein features to further improve the performance. Besides, Graph2GO is also applicable to other application scenarios involving biological networks, and the learned latent representations can be used as feature inputs for machine learning tasks in various downstream analyses.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Columbus, OH 43210, USA
| |
Collapse
|
198
|
Le DH. UFO: A tool for unifying biomedical ontology-based semantic similarity calculation, enrichment analysis and visualization. PLoS One 2020; 15:e0235670. [PMID: 32645039 PMCID: PMC7347127 DOI: 10.1371/journal.pone.0235670] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 06/22/2020] [Indexed: 02/06/2023] Open
Abstract
Background Biomedical ontologies have been growing quickly and proven to be useful in many biomedical applications. Important applications of those data include estimating the functional similarity between ontology terms and between annotated biomedical entities, analyzing enrichment for a set of biomedical entities. Many semantic similarity calculation and enrichment analysis methods have been proposed for such applications. Also, a number of tools implementing the methods have been developed on different platforms. However, these tools have implemented a small number of the semantic similarity calculation and enrichment analysis methods for a certain type of biomedical ontology. Note that the methods can be applied to all types of biomedical ontologies. More importantly, each method can be dominant in different applications; thus, users have more choice with more number of methods implemented in tools. Also, more functions would facilitate their task with ontology. Results In this study, we developed a Cytoscape app, named UFO, which unifies most of the semantic similarity measures for between-term and between-entity similarity calculation for all types of biomedical ontologies in OBO format. Based on the similarity calculation, UFO can calculate the similarity between two sets of entities and weigh imported entity networks as well as generate functional similarity networks. Besides, it can perform enrichment analysis of a set of entities by different methods. Moreover, UFO can visualize structural relationships between ontology terms, annotating relationships between entities and terms, and functional similarity between entities. Finally, we demonstrated the ability of UFO through some case studies on finding the best semantic similarity measures for assessing the similarity between human disease phenotypes, constructing biomedical entity functional similarity networks for predicting disease-associated biomarkers, and performing enrichment analysis on a set of similar phenotypes. Conclusions Taken together, UFO is expected to be a tool where biomedical ontologies can be exploited for various biomedical applications. Availability UFO is distributed as a Cytoscape app, and can be downloaded freely at Cytoscape App (http://apps.cytoscape.org/apps/ufo) for non-commercial use
Collapse
Affiliation(s)
- Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam
- School of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
- * E-mail:
| |
Collapse
|
199
|
Liu Z, Chen Q, Lan W, Liang J, Chen YPP, Chen B. A Survey of Network Embedding for Drug Analysis and Prediction. Curr Protein Pept Sci 2020; 22:CPPS-EPUB-107859. [PMID: 32614745 DOI: 10.2174/1389203721666200702145701] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/05/2020] [Accepted: 05/21/2020] [Indexed: 11/22/2022]
Abstract
Traditional network-based computational methods have shown good results in drug analysis and prediction. However, these methods are time consuming and lack universality, and it is difficult to exploit the auxiliary information of nodes and edges. Network embedding provides a promising way for alleviating the above problems by transforming network into a low-dimensional space while preserving network structure and auxiliary information. This thus facilitates the application of machine learning algorithms for subsequent processing. Network embedding has been introduced into drug analysis and prediction in the last few years, and has shown superior performance over traditional methods. However, there is no systematic review of this issue. This article offers a comprehensive survey of the primary network embedding methods and their applications in drug analysis and prediction. The network embedding technologies applied in homogeneous network and heterogeneous network are investigated and compared, including matrix decomposition, random walk, and deep learning. Especially, the Graph neural network (GNN) methods in deep learning are highlighted. Further, the applications of network embedding in drug similarity estimation, drug-target interaction prediction, adverse drug reactions prediction, protein function and therapeutic peptides prediction are discussed. Several future potential research directions are also discussed.
Collapse
Affiliation(s)
- Zhixian Liu
- School of Medical, Guangxi University, Nanning. China
| | - Qingfeng Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning. China
| | - Jiahai Liang
- School of Electronics and Information Engineering, Beibu Gulf University, Qinzhou. China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Melbourne. Australia
| | - Baoshan Chen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-bioresources, Guangxi University, Nanning. China
| |
Collapse
|
200
|
Lennox M, Robertson N, Devereux B. Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:2361-2367. [PMID: 33018481 DOI: 10.1109/embc44109.2020.9176380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.
Collapse
|