Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Peng Y, Arighi C, Wu CH, Vijay-Shanker K. BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph. Database (Oxford) 2016;2016:baw072. [PMID: 27170286 PMCID: PMC4915133 DOI: 10.1093/database/baw072] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/12/2016] [Indexed: 12/04/2022]

For:	Peng Y, Arighi C, Wu CH, Vijay-Shanker K. BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph. Database (Oxford) 2016;2016:baw072. [PMID: 27170286 PMCID: PMC4915133 DOI: 10.1093/database/baw072] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/12/2016] [Indexed: 12/04/2022]

Number

Cited by Other Article(s)

Badal VD, Kundrotas PJ, Vakser IA. Text mining for modeling of protein complexes enhanced by machine learning. Bioinformatics 2021;37:497-505. [PMID: 32960948 PMCID: PMC8088328 DOI: 10.1093/bioinformatics/btaa823] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 09/04/2020] [Accepted: 09/08/2020] [Indexed: 11/14/2022] Open

Abstract

MOTIVATION

Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins.

RESULTS

We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles.

AVAILABILITYAND IMPLEMENTATION

The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019;6:52. [PMID: 31076572 PMCID: PMC6510737 DOI: 10.1038/s41597-019-0055-0] [Citation(s) in RCA: 171] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Accepted: 03/27/2019] [Indexed: 11/10/2022] Open

Ding R, Qu Y, Wu CH, Vijay-Shanker K. Automatic gene annotation using GO terms from cellular component domain. BMC Med Inform Decis Mak 2018;18:119. [PMID: 30526566 PMCID: PMC6284271 DOI: 10.1186/s12911-018-0694-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Abstract

Background

The Gene Ontology (GO) is a resource that supplies information about gene product function using ontologies to represent biological knowledge. These ontologies cover three domains: Cellular Component (CC), Molecular Function (MF), and Biological Process (BP). GO annotation is a process which assigns gene functional information using GO terms to relevant genes in the literature. It is a common task among the Model Organism Database (MOD) groups. Manual GO annotation relies on human curators assigning gene functional information using GO terms by reading the biomedical literature. This process is very time-consuming and labor-intensive. As a result, many MODs can afford to curate only a fraction of relevant articles.

Methods

GO terms from the CC domain can be essentially divided into two sub-hierarchies: subcellular location terms, and protein complex terms. We cast the task of gene annotation using GO terms from the CC domain as relation extraction between gene and other entities: (1) extract cases where a protein is found to be in a subcellular location, and (2) extract cases where a protein is a subunit of a protein complex. For each relation extraction task, we use an approach based on triggers and syntactic dependencies to extract the desired relations among entities.

Results

We tested our approach on the BC4GO test set, a publicly available corpus for GO annotation. Our approach obtains a F1-score of 71%, a precision of 91% and a recall of 58% for predicting GO terms from CC Domain for given genes.

Conclusions

We have described a novel approach of treating gene annotation with GO terms from CC domain as two relation extraction subtasks. Evaluation results show that our approach achieves a F1-score of 71% for predicting GO terms for given genes. Thereby our approach can be used to accelerate the process of GO annotation for the bio-annotators.

Collapse

Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]

Abstract

A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.

Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html

Collapse

Kim S, Islamaj Doğan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Batista-Navarro R, Carter J, Ananiadou S, Matos S, Santos A, Campos D, Oliveira JL, Singh O, Jonnagaddala J, Dai HJ, Su ECY, Chang YC, Su YC, Chu CH, Chen CC, Hsu WL, Peng Y, Arighi C, Wu CH, Vijay-Shanker K, Aydın F, Hüsünbeyi ZM, Özgür A, Shin SY, Kwon D, Dolinski K, Tyers M, Wilbur WJ, Comeau DC. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford) 2016;2016:baw121. [PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Revised: 07/29/2016] [Accepted: 08/02/2016] [Indexed: 11/14/2022]

Affiliation(s)

Sun Kim National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Rezarta Islamaj Doğan National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Andrew Chatr-Aryamontri Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada
Christie S Chang Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Rose Oughtred Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Jennifer Rust Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Riza Batista-Navarro National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Jacob Carter National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sophia Ananiadou National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sérgio Matos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
André Santos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
David Campos BMD Software, Lda, Rua Calouste Gulbenkian 1, 3810-074 Aveiro, Portugal
José Luís Oliveira DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
Onkar Singh Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Jitendra Jonnagaddala School of Public Health and Community Medicine, University of New South Wales, Kensington NSW 2033, Australia Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2033, Australia
Hong-Jie Dai Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
Emily Chia-Yu Su Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Yung-Chun Chang Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
Yu-Chen Su Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Chun-Han Chu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chien Chin Chen Department of Information Management, National Taiwan University, Taipei, Taiwan
Wen-Lian Hsu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Yifan Peng Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Cecilia Arighi Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
Cathy H Wu Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
K Vijay-Shanker Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Ferhat Aydın Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Zehra Melce Hüsünbeyi Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Arzucan Özgür Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Soo-Yong Shin Department of Biomedical Informatics, Asan Medical Center, 138-736 Seoul, South Korea
Dongseop Kwon Department of Computer Engineering, Myongji University, 449-728 Yongin, South Korea
Kara Dolinski Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Mike Tyers Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
W John Wilbur National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Donald C Comeau National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Collapse