1
|
Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins 2019; 88:397-413. [PMID: 31603244 DOI: 10.1002/prot.25832] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 07/05/2019] [Accepted: 09/17/2019] [Indexed: 12/17/2022]
Abstract
Proteins play important roles in living organisms, and their function is directly linked with their structure. Due to the growing gap between the number of proteins being discovered and their functional characterization (in particular as a result of experimental limitations), reliable prediction of protein function through computational means has become crucial. This paper reviews the machine learning techniques used in the literature, following their evolution from simple algorithms such as logistic regression to more advanced methods like support vector machines and modern deep neural networks. Hyperparameter optimization methods adopted to boost prediction performance are presented. In parallel, the metamorphosis in the features used by these algorithms from classical physicochemical properties and amino acid composition, up to text-derived features from biomedical literature and learned feature representations using autoencoders, together with feature selection and dimensionality reduction techniques, are also reviewed. The success stories in the application of these techniques to both general and specific protein function prediction are discussed.
Collapse
Affiliation(s)
- Rosalin Bonetta
- Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
| | - Gianluca Valentino
- Department of Communications and Computer Engineering, University of Malta, Msida, Malta
| |
Collapse
|
2
|
Taha K, Iraqi Y, Al Aamri A. Predicting protein functions by applying predicate logic to biomedical literature. BMC Bioinformatics 2019; 20:71. [PMID: 30736739 PMCID: PMC6368809 DOI: 10.1186/s12859-019-2594-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Accepted: 01/03/2019] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND A large number of computational methods have been proposed for predicting protein functions. The underlying techniques adopted by most of these methods revolve around predicting the functions of an unannotated protein p from already annotated proteins that have similar characteristics as p. Recent Information Extraction methods take advantage of the huge growth of biomedical literature to predict protein functions. They extract biological molecule terms that directly describe protein functions from biomedical texts. However, they consider only explicitly mentioned terms that co-occur with proteins in texts. We observe that some important biological molecule terms pertaining functional categories may implicitly co-occur with proteins in texts. Therefore, the methods that rely solely on explicitly mentioned terms in texts may miss vital functional information implicitly mentioned in the texts. RESULTS To overcome the limitations of methods that rely solely on explicitly mentioned terms in texts to predict protein functions, we propose in this paper an Information Extraction system called PL-PPF. The proposed system employs techniques for predicting the functions of proteins based on their co-occurrences with explicitly and implicitly mentioned biological molecule terms that pertain functional categories in biomedical literature. That is, PL-PPF employs a combination of statistical-based explicit term extraction techniques and logic-based implicit term extraction techniques. The statistical component of PL-PPF predicts some of the functions of a protein by extracting the explicitly mentioned functional terms that directly describe the functions of the protein from the biomedical texts associated with the protein. The logic-based component of PL-PPF predicts additional functions of the protein by inferring the functional terms that co-occur implicitly with the protein in the biomedical texts associated with it. First, the system employs its statistical-based component to extract the explicitly mentioned functional terms. Then, it employs its logic-based component to infer additional functions of the protein. Our hypothesis is that important biological molecule terms pertaining functional categories of proteins are likely to co-occur implicitly with the proteins in biomedical texts. We evaluated PL-PPF experimentally and compared it with five systems. Results revealed better prediction performance. CONCLUSIONS The experimental results showed that PL-PPF outperformed the other five systems. This is an indication of the effectiveness and practical viability of PL-PPF's combination of explicit and implicit techniques. We also evaluated two versions of PL-PPF: one adopting the complete techniques (i.e., adopting both the implicit and explicit techniques) and the other adopting only the explicit terms co-occurrence extraction techniques (i.e., without the inference rules for predicate logic). The experimental results showed that the complete version outperformed significantly the other version. This is attributed to the effectiveness of the rules of predicate logic to infer functional terms that co-occur implicitly with proteins in biomedical texts. A demo application of PL-PPF can be accessed through the following link: http://ecesrvr.kustar.ac.ae:8080/plppf/.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Youssef Iraqi
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Amira Al Aamri
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
3
|
Fodeh SJ, Tiwari A. Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification. J Biomed Inform 2018; 86:160-166. [PMID: 30130573 DOI: 10.1016/j.jbi.2018.08.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 08/13/2018] [Accepted: 08/17/2018] [Indexed: 11/25/2022]
Abstract
Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.
Collapse
Affiliation(s)
- Samah Jamal Fodeh
- Yale Center for Medical Informatics, Yale University, 300 George st, Suite 501, New Haven, CT 06511, United States.
| | | |
Collapse
|
4
|
Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4780291. [PMID: 31725864 PMCID: PMC7243923 DOI: 10.1093/database/bax087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 10/25/2017] [Accepted: 10/27/2017] [Indexed: 12/24/2022]
Abstract
Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml
Collapse
Affiliation(s)
- Michael Bada
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| | - Nicole Vasilevsky
- Ontology Development Group, Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA
| | - William A Baumgartner
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| | - Melissa Haendel
- Ontology Development Group, Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA
| | - Lawrence E Hunter
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| |
Collapse
|
5
|
Funk CS, Cohen KB, Hunter LE, Verspoor KM. Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition. J Biomed Semantics 2016; 7:52. [PMID: 27613112 PMCID: PMC5018193 DOI: 10.1186/s13326-016-0096-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 08/05/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. RESULTS We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. CONCLUSIONS In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.
Collapse
Affiliation(s)
- Christopher S. Funk
- Computational Bioscience, University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - K. Bretonnel Cohen
- Computational Bioscience, University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Lawrence E. Hunter
- Computational Bioscience, University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Karin M. Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Melbourne, 3010 Australia
- Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Melbourne, 3010 Australia
| |
Collapse
|
6
|
Verspoor KM, Heo GE, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inform Decis Mak 2016; 16 Suppl 1:68. [PMID: 27454860 PMCID: PMC4959367 DOI: 10.1186/s12911-016-0294-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems. METHODS In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus. RESULTS For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations. CONCLUSIONS This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.
Collapse
Affiliation(s)
- Karin M Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| | - Go Eun Heo
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Keun Young Kang
- Department of Library and Information Science, Yonsei University, Seoul, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea.
| |
Collapse
|
7
|
Cao R, Cheng J. Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks. Methods 2016; 93:84-91. [PMID: 26370280 PMCID: PMC4894840 DOI: 10.1016/j.ymeth.2015.09.011] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 09/03/2015] [Accepted: 09/10/2015] [Indexed: 11/30/2022] Open
Abstract
MOTIVATIONS Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein-protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene-gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. RESULTS In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein-protein interaction and spatial gene-gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein-protein interaction and spatial gene-gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile-sequence comparison, profile-profile comparison, and domain co-occurrence networks according to the maximum F-measure.
Collapse
Affiliation(s)
- Renzhi Cao
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | - Jianlin Cheng
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA.
| |
Collapse
|
8
|
Taha K, Yoo PD. Predicting the functions of a protein from its ability to associate with other molecules. BMC Bioinformatics 2016; 17:34. [PMID: 26767846 PMCID: PMC4714473 DOI: 10.1186/s12859-016-0882-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 01/05/2016] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND All proteins associate with other molecules. These associated molecules are highly predictive of the potential functions of proteins. The association of a protein and a molecule can be determined from their co-occurrences in biomedical abstracts. Extensive semantically related co-occurrences of a protein's name and a molecule's name in the sentences of biomedical abstracts can be considered as indicative of the association between the protein and the molecule. Dependency parsers extract textual relations from a text by determining the grammatical relations between words in a sentence. They can be used for determining the textual relations between proteins and molecules. Despite their success, they may extract textual relations with low precision. This is because they do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). Moreover, they may not be well suited for complex sentences and for long-distance textual relations. RESULTS We introduce an information extraction system called PPFBM that predicts the functions of unannotated proteins from the molecules that associate with these proteins. PPFBM represents each protein by the other molecules that associate with it in the abstracts referenced in the protein's entries in reliable biological databases. It automatically extracts each co-occurrence of a protein-molecule pair that represents semantic relationship between the pair. Towards this, we present novel semantic rules that identify the semantic relationship between each co-occurrence of a protein-molecule pair using the syntactic structures of sentences and linguistics theories. PPFBM determines the functions of an un-annotated protein p as follows. First, it determines the set S r of annotated proteins that is semantically similar to p by matching the molecules representing p and the annotated proteins. Then, it assigns p the functional category FC if the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with FC is statistically significantly different than the significance of the frequency of occurrences of S r in abstracts associated with proteins annotated with all other functional categories. We evaluated the quality of PPFBM by comparing it experimentally with two other systems. Results showed marked improvement. CONCLUSIONS The experimental results demonstrated that PPFBM outperforms other systems that predict protein function from the textual information found within biomedical abstracts. This is because these system do not consider the semantic relationships between terms in a sentence (i.e., they consider only the structural relationships between the terms). PPFBM's performance over these system increases steadily as the number of training protein increases. That is, PPFBM's prediction performance becomes more accurate constantly, as the size of training proteins gets larger. This is because every time a new set of test proteins is added to the current set of training proteins. A demo of PPFBM that annotates each input Yeast protein (SGD (Saccharomyces Genome Database). Available at: http://www.yeastgenome.org/download-data/curation) with the functions of Gene Ontology terms is available at: (see Appendix for more details about the demo) http://ecesrvr.kustar.ac.ae:8080/PPFBM/.
Collapse
Affiliation(s)
- Kamal Taha
- Department of Electrical and Computer Engineering, Khalifa University, Abu Dhabi, United Arab Emirates.
| | - Paul D Yoo
- Faculty of Science and Technology, Bournemouth University, Bournemouth, UK.
| |
Collapse
|
9
|
Shehu A, Barbará D, Molloy K. A Survey of Computational Methods for Protein Function Prediction. BIG DATA ANALYTICS IN GENOMICS 2016:225-298. [DOI: 10.1007/978-3-319-41279-5_7] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
10
|
GoFDR: A sequence alignment based method for predicting protein functions. Methods 2016; 93:3-14. [DOI: 10.1016/j.ymeth.2015.08.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 07/27/2015] [Accepted: 08/11/2015] [Indexed: 01/01/2023] Open
|
11
|
Soldatova LN, Collier N, Oellrich A, Groza T, Verspoor K, Rocca-Serra P, Dumontier M, Shah NH. Special issue on bio-ontologies and phenotypes. J Biomed Semantics 2015; 6:40. [PMID: 26682035 PMCID: PMC4682270 DOI: 10.1186/s13326-015-0040-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/15/2015] [Indexed: 11/10/2022] Open
Abstract
The bio-ontologies and phenotypes special issue includes eight papers selected from the 11 papers presented at the Bio-Ontologies SIG (Special Interest Group) and the Phenotype Day at ISMB (Intelligent Systems for Molecular Biology) conference in Boston in 2014. The selected papers span a wide range of topics including the automated re-use and update of ontologies, quality assessment of ontological resources, and the systematic description of phenotype variation, driven by manual, semi- and fully automatic means.
Collapse
Affiliation(s)
| | | | | | - Tudor Groza
- The Garvan Institute of Medical Research, Sydney, Australia
| | | | | | | | | |
Collapse
|
12
|
Kahanda I, Funk CS, Ullah F, Verspoor KM, Ben-Hur A. A close look at protein function prediction evaluation protocols. Gigascience 2015; 4:41. [PMID: 26380075 PMCID: PMC4570743 DOI: 10.1186/s13742-015-0082-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Accepted: 08/24/2015] [Indexed: 01/04/2023] Open
Abstract
Background The recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance. Results The CAFA2 task is a combination of two subtasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (structured support vector machine, binary support vector machines and guilt-by-association methods) do not usually achieve the same level of accuracy on these two tasks as that achieved by cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods. Conclusions These results have implications for the design of computational experiments in the area of automated function prediction and can provide useful insight for the understanding and design of future CAFA competitions. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0082-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Indika Kahanda
- Department of Computer Science, Colorado State University, Fort Collins, 80523 CO USA
| | - Christopher S Funk
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, 80045 CO USA
| | - Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, 80523 CO USA
| | - Karin M Verspoor
- Department of Computing and Information Systems, University of Melbourne, 3010 Parkville, Victoria, Australia
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, 80523 CO USA
| |
Collapse
|
13
|
Zheng W, Blake C. Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles. J Biomed Inform 2015. [PMID: 26220461 DOI: 10.1016/j.jbi.2015.07.013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41 (82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts.
Collapse
Affiliation(s)
- Wu Zheng
- Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, USA.
| | - Catherine Blake
- Graduate School of Library and Information Science and Medical Information Science, Center for Informatics Research in Science and Scholarship (CIRSS), University of Illinois at Urbana-Champaign, USA.
| |
Collapse
|
14
|
Kahanda I, Funk C, Verspoor K, Ben-Hur A. PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources. F1000Res 2015; 4:259. [PMID: 26834980 PMCID: PMC4722686 DOI: 10.12688/f1000research.6670.1] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/06/2015] [Indexed: 01/21/2023] Open
Abstract
The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.
Collapse
Affiliation(s)
- Indika Kahanda
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| | - Christopher Funk
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Karin Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Victoria, 3010, Australia; Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Victoria, 3010, Australia
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| |
Collapse
|