1
|
Vazquez M, Krallinger M, Leitner F, Kuiper M, Valencia A, Laegreid A. ExTRI: Extraction of transcription regulation interactions from literature. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2022; 1865:194778. [PMID: 34875418 DOI: 10.1016/j.bbagrm.2021.194778] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 11/22/2021] [Accepted: 11/29/2021] [Indexed: 11/18/2022]
Abstract
The regulation of gene transcription by transcription factors is a fundamental biological process, yet the relations between transcription factors (TF) and their target genes (TG) are still only sparsely covered in databases. Text-mining tools can offer broad and complementary solutions to help locate and extract mentions of these biological relationships in articles. We have generated ExTRI, a knowledge graph of TF-TG relationships, by applying a high recall text-mining pipeline to MedLine abstracts identifying over 100,000 candidate sentences with TF-TG relations. Validation procedures indicated that about half of the candidate sentences contain true TF-TG relationships. Post-processing identified 53,000 high confidence sentences containing TF-TG relationships, with a cross-validation F1-score close to 75%. The resulting collection of TF-TG relationships covers 80% of the relations annotated in existing databases. It adds 11,000 other potential interactions, including relationships for ~100 TFs currently not in public TF-TG relation databases. The high confidence abstract sentences contribute 25,000 literature references not available from other resources and offer a wealth of direct pointers to functional aspects of the TF-TG interactions. Our compiled resource encompassing ExTRI together with publicly available resources delivers literature-derived TF-TG interactions for more than 900 of the 1500-1600 proteins considered to function as specific DNA binding TFs. The obtained result can be used by curators, for network analysis and modelling, for causal reasoning or knowledge graph mining approaches, or serve to benchmark text mining strategies.
Collapse
Affiliation(s)
| | | | | | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Alfonso Valencia
- Barcelona Supercomputing Center, Barcelona, Spain; ICREA, Barcelona, Spain
| | - Astrid Laegreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim 7491, Norway
| |
Collapse
|
2
|
Karimi K, Agalakov S, Telmer CA, Beatman TR, Pells TJ, Arshinoff BI, Ku CJ, Foley S, Hinman VF, Ettensohn CA, Vize PD. Classifying domain-specific text documents containing ambiguous keywords. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6377760. [PMID: 34585729 PMCID: PMC8588847 DOI: 10.1093/database/baab062] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 08/23/2021] [Accepted: 09/16/2021] [Indexed: 11/14/2022]
Abstract
A keyword-based search of comprehensive databases such as PubMed may return
irrelevant papers, especially if the keywords are used in multiple fields of
study. In such cases, domain experts (curators) need to verify the results and
remove the irrelevant articles. Automating this filtering process will save
time, but it has to be done well enough to ensure few relevant papers are
rejected and few irrelevant papers are accepted. A good solution would be fast,
work with the limited amount of data freely available (full paper body may be
missing), handle ambiguous keywords and be as domain-neutral as possible. In
this paper, we evaluate a number of classification algorithms for identifying a
domain-specific set of papers about echinoderm species and show that the
resulting tool satisfies most of the abovementioned requirements. Echinoderms
consist of a number of very different organisms, including brittle stars, sea
stars (starfish), sea urchins and sea cucumbers. While their taxonomic
identifiers are specific, the common names are used in many other contexts,
creating ambiguity and making a keyword search prone to error. We try
classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM,
Bagging, AdaBoost and Neural Network learning models and compare their
performance. We show how effective the resulting classifiers are in filtering
irrelevant articles returned from PubMed. The methodology used is more dependent
on the good selection of training data and is a practical solution that can be
applied to other fields of study facing similar challenges. Database URL The code and date reported in this paper are freely available at
http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/
Collapse
Affiliation(s)
- Kamran Karimi
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Sergei Agalakov
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Cheryl A Telmer
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Thomas R Beatman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Troy J Pells
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Bradley Im Arshinoff
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Carolyn J Ku
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Veronica F Hinman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Charles A Ettensohn
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| |
Collapse
|
3
|
Arango-Argoty GA, Guron GKP, Garner E, Riquelme MV, Heath LS, Pruden A, Vikesland PJ, Zhang L. ARGminer: a web platform for the crowdsourcing-based curation of antibiotic resistance genes. Bioinformatics 2020; 36:2966-2973. [PMID: 32058567 DOI: 10.1093/bioinformatics/btaa095] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 01/31/2020] [Accepted: 02/08/2020] [Indexed: 12/20/2022] Open
Affiliation(s)
| | - G K P Guron
- Department of Civil and Environmental Engineering.,Department of Food Science and Technology, Virginia Tech, Blacksburg, VA 24061 - 0217, USA
| | - E Garner
- Department of Civil and Environmental Engineering
| | - M V Riquelme
- Department of Civil and Environmental Engineering
| | | | - A Pruden
- Department of Civil and Environmental Engineering
| | | | - L Zhang
- Department of Computer Science
| |
Collapse
|
4
|
Wang CCN, Jin J, Chang JG, Hayakawa M, Kitazawa A, Tsai JJP, Sheu PCY. Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization. BMC Med Inform Decis Mak 2020; 20:208. [PMID: 32883271 PMCID: PMC7469322 DOI: 10.1186/s12911-020-01227-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/20/2020] [Indexed: 12/02/2022] Open
Abstract
Background Gastrointestinal (GI) cancer including colorectal cancer, gastric cancer, pancreatic cancer, etc., are among the most frequent malignancies diagnosed annually and represent a major public health problem worldwide. Methods This paper reports an aided curation pipeline to identify potential influential genes for gastrointestinal cancer. The curation pipeline integrates biomedical literature to identify named entities by Bi-LSTM-CNN-CRF methods. The entities and their associations can be used to construct a graph, and from which we can compute the sets of co-occurring genes that are the most influential based on an influence maximization algorithm. Results The sets of co-occurring genes that are the most influential that we discover include RARA - CRBP1, CASP3 - BCL2, BCL2 - CASP3 – CRBP1, RARA - CASP3 – CRBP1, FOXJ1 - RASSF3 - ESR1, FOXJ1 - RASSF1A - ESR1, FOXJ1 - RASSF1A - TNFAIP8 - ESR1. With TCGA and functional and pathway enrichment analysis, we prove the proposed approach works well in the context of gastrointestinal cancer. Conclusions Our pipeline that uses text mining to identify objects and relationships to construct a graph and uses graph-based influence maximization to discover the most influential co-occurring genes presents a viable direction to assist knowledge discovery for clinical applications.
Collapse
Affiliation(s)
- Charles C N Wang
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.,Center for Artificial Intelligence in Precision Medicine, UAsia University, Taichung, Taiwan
| | - Jennifer Jin
- Department of EECS and BME, University of California, Irvine, USA
| | - Jan-Gowth Chang
- Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.,Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.,Graduate Institute of Clinical Medical Science, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan
| | | | | | - Jeffrey J P Tsai
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Phillip C-Y Sheu
- Department of EECS and BME, University of California, Irvine, USA.
| |
Collapse
|
5
|
Salimi N, Edwards L, Foos G, Greenbaum JA, Martini S, Reardon B, Shackelford D, Vita R, Zalman L, Peters B, Sette A. A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts. Immunology 2020; 161:139-147. [PMID: 32615639 PMCID: PMC7496777 DOI: 10.1111/imm.13234] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 06/11/2020] [Accepted: 06/22/2020] [Indexed: 12/13/2022] Open
Abstract
The Immune Epitope Database and Analysis Resource (IEDB) provides the scientific community with open access to epitope data, as well as epitope prediction and analysis tools. The IEDB houses the most extensive collection of experimentally validated B‐cell and T‐cell epitope data, sourced primarily from published literature by expert curation. The data procurement requires systematic identification, categorization, curation and quality‐checking processes. Here, we provide insights into these processes, with particular focus on the dividends they have paid in terms of attaining project milestones, as well as how objective analyses of our processes have identified opportunities for process optimization. These experiences are shared as a case study of the benefits of process implementation and review in biomedical big data, as well as to encourage idea‐sharing among players in this ever‐growing space.
Collapse
Affiliation(s)
- Nima Salimi
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Lindy Edwards
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Gabriele Foos
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Jason A Greenbaum
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Sheridan Martini
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Brian Reardon
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Deborah Shackelford
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Randi Vita
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Leora Zalman
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Bjoern Peters
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| | - Alessandro Sette
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| |
Collapse
|
6
|
Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, Lu Z. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res 2020; 47:W594-W599. [PMID: 31020319 DOI: 10.1093/nar/gkz289] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Revised: 04/05/2019] [Accepted: 04/10/2019] [Indexed: 11/15/2022] Open
Abstract
Literature search is a routine practice for scientific studies as new discoveries build on knowledge from the past. Current tools (e.g. PubMed, PubMed Central), however, generally require significant effort in query formulation and optimization (especially in searching the full-length articles) and do not allow direct retrieval of specific statements, which is key for tasks such as comparing/validating new findings with previous knowledge and performing evidence attribution in biocuration. Thus, we introduce LitSense, which is the first web-based system that specializes in sentence retrieval for biomedical literature. LitSense provides unified access to PubMed and PMC content with over a half-billion sentences in total. Given a query, LitSense returns best-matching sentences using both a traditional term-weighting approach that up-weights sentences that contain more of the rare terms in the user query as well as a novel neural embedding approach that enables the retrieval of semantically relevant results without explicit keyword match. LitSense provides a user-friendly interface that assists its users to quickly browse the returned sentences in context and/or further filter search results by section or publication date. LitSense also employs PubTator to highlight biomedical entities (e.g. gene/proteins) in the sentences for better result visualization. LitSense is freely available at https://www.ncbi.nlm.nih.gov/research/litsense.
Collapse
Affiliation(s)
- Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Roberto Vera Alvarez
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
7
|
Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020; 47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 228] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open
Abstract
PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
8
|
Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020; 16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
9
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
10
|
Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database (Oxford) 2019; 2019:baz045. [PMID: 31032839 PMCID: PMC6482935 DOI: 10.1093/database/baz045] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 02/26/2019] [Accepted: 03/18/2019] [Indexed: 01/01/2023]
Abstract
Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
Collapse
Affiliation(s)
- Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | | | - Judith A Blake
- The Jackson Laboratory, 600 Main St., Bar Harbor, ME, USA
| | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| | - Gongbo Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
- Center of Bioinformatics and Computational Biology, Delaware Biotechnology Institute, Newark, DE, USA
| |
Collapse
|
11
|
Yang X, Song Z, Wu C, Wang W, Li G, Zhang W, Wu L, Lu K. Constructing a database for the relations between CNV and human genetic diseases via systematic text mining. BMC Bioinformatics 2018; 19:528. [PMID: 30598077 PMCID: PMC6311945 DOI: 10.1186/s12859-018-2526-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The detection and interpretation of CNVs are of clinical importance in genetic testing. Several databases and web services are already being used by clinical geneticists to interpret the medical relevance of identified CNVs in patients. However, geneticists or physicians would like to obtain the original literature context for more detailed information, especially for rare CNVs that were not included in databases. RESULTS The resulting CNVdigest database includes 440,485 sentences for CNV-disease relationship. A total number of 1582 CNVs and 2425 diseases are involved. Sentences describing CNV-disease correlations are indexed in CNVdigest, with CNV mentions and disease mentions annotated. CONCLUSIONS In this paper, we use a systematic text mining method to construct a database for the relationship between CNVs and diseases. Based on that, we also developed a concise front-end to facilitate the analysis of CNV/disease association, providing a user-friendly web interface for convenient queries. The resulting system is publically available at http://cnv.gtxlab.com /.
Collapse
Affiliation(s)
- Xi Yang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Zhuo Song
- Genetalks Biotech Inc., Beijing, 100176, China
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China. .,Institute for Quantum Information & State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha, 410073, China.
| | - Wei Wang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Gen Li
- Genetalks Biotech Inc., Beijing, 100176, China
| | - Wei Zhang
- Genetalks Biotech Inc., Beijing, 100176, China
| | - Lingqian Wu
- Center for Medical Genetics, Central South University, 110 Xiangya Road, Changsha, 410078, Hunan, China.
| | - Kai Lu
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| |
Collapse
|
12
|
Lee K, Famiglietti ML, McMahon A, Wei CH, MacArthur JAL, Poux S, Breuza L, Bridge A, Cunningham F, Xenarios I, Lu Z. Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 2018; 14:e1006390. [PMID: 30102703 PMCID: PMC6107285 DOI: 10.1371/journal.pcbi.1006390] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 08/23/2018] [Accepted: 07/24/2018] [Indexed: 11/18/2022] Open
Abstract
Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. As the volume of literature on genomic variants continues to grow at an increasing rate, it is becoming more difficult for a curator of a variant knowledge base to keep up with and curate all the published papers. Here, we suggest a deep learning-based literature triage method for genomic variation resources. Our method achieves state-of-the-art performance on the triage task. Moreover, our model does not require any laborious preprocessing or feature engineering steps, which are required for traditional machine learning triage methods. We applied our method to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with the database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion.
Collapse
Affiliation(s)
- Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | | | - Aoife McMahon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Jacqueline Ann Langdon MacArthur
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Sylvain Poux
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Lionel Breuza
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Alan Bridge
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Ioannis Xenarios
- Center for Integrative Genomics, University of Lausanne, Lausanne Switzerland.,Department of Chemistry and Biochemistry, University of Geneva, Geneva, Switzerland
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
13
|
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018; 10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open
Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Sophia Daikou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Kenju Ueno
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Jun’ichi Tsujii
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| |
Collapse
|
14
|
Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics 2018; 19:94. [PMID: 29523070 PMCID: PMC5845379 DOI: 10.1186/s12859-018-2103-8] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 03/01/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. RESULTS We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. CONCLUSION Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.
Collapse
Affiliation(s)
- H.-M. Müller
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - K. M. Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - Y. Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - P. W. Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| |
Collapse
|
15
|
Hsu YY, Wei CH, Lu Z. Assisting document triage for human kinome curation via machine learning. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5094578. [PMID: 30239677 PMCID: PMC6146134 DOI: 10.1093/database/bay091] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 08/13/2018] [Indexed: 11/16/2022]
Abstract
In the era of data explosion, the increasing frequency of published articles presents unorthodox challenges to fulfill specific curation requirements for bio-literature databases. Recognizing these demands, we designed a document triage system with automatic methods that can improve efficiency to retrieve the most relevant articles in curation workflows and reduce workloads for biocurators. Since the BioCreative VI (2017), we have implemented texting mining processing in our system in hopes of providing higher effectiveness for curating articles related to human kinase proteins. We tested several machine learning methods together with state-of-the-art concept extraction tools. For features, we extracted rich co-occurrence and linguistic information to model the curation process of human kinome articles by the neXtProt database. As shown in the official evaluation on the human kinome curation task in BioCreative VI, our system can effectively retrieve 5.2 and 6.5 kinase articles with the relevant disease (DIS) and biological process (BP) information, respectively, among the top 100 returned results. Comparing to neXtA5, our system demonstrates significant improvements in prioritizing kinome-related articles as follows: our system achieves 0.458 and 0.109 for the DIS axis whereas the neXtA5’s best-reported mean average precision (MAP) and maximum precision observed are 0.41 and 0.04. Our system also outperforms the neXtA5 in retrieving BP axis with 0.195 for MAP and the neXtA5’s reported value was 0.11. These results suggest that our system may be able to assist neXtProt biocurators in practice.
Collapse
Affiliation(s)
- Yi-Yu Hsu
- National Center for Biotechnology Information, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD, USA
| |
Collapse
|
16
|
Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018; 2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023]
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
Collapse
Affiliation(s)
- Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| |
Collapse
|
17
|
Jiang X, Ringwald M, Blake J, Shatkay H. Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3084695. [PMID: 28365740 PMCID: PMC5467553 DOI: 10.1093/database/bax017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/13/2017] [Indexed: 12/16/2022]
Abstract
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL:www.informatics.jax.org
Collapse
Affiliation(s)
- Xiangying Jiang
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, USA
| | - Martin Ringwald
- Department of Computer and Information Sciences, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Judith Blake
- Department of Computer and Information Sciences, The Jackson Laboratory, 600 Main Street, Bar Harbor, ME, USA
| | - Hagit Shatkay
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, USA
| |
Collapse
|
18
|
Hassani-Pak K, Rawlings C. Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2016-0002/jib-2016-0002.xml. [PMID: 28609292 PMCID: PMC6042805 DOI: 10.1515/jib-2016-0002] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Accepted: 02/16/2017] [Indexed: 02/06/2023] Open
Abstract
Genetics and “omics” studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.
Collapse
|
19
|
Abstract
Easy access to a vast collection of experimental data on immune epitopes can greatly facilitate the development of therapeutics and vaccines. The Immune Epitope Database and Analysis Resource (IEDB) was developed to provide such a resource as a free service to the biomedical research community. The IEDB contains epitope and assay information related to infectious diseases, autoimmune diseases, allergic diseases, and transplant/alloantigens for humans, nonhuman primates, mice, and any other species studied. It contains T cell, B cell, MHC binding, and MHC ligand elution experiments. Its data are curated primarily from the published literature and also include direct submissions from researchers involved in epitope discovery. This article describes the process of capturing data from these sources and how the information is organized in the IEDB data. Different approaches for querying the data are then presented, using the home page search interface and the various specialized search interfaces. Specific examples covering diverse applications of interest are given to highlight the power and functionality of the IEDB.
Collapse
|
20
|
Mao Y, Lu Z. MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 2017; 8:15. [PMID: 28412964 PMCID: PMC5392968 DOI: 10.1186/s13326-017-0123-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Accepted: 03/16/2017] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND MeSH indexing is the task of assigning relevant MeSH terms based on a manual reading of scholarly publications by human indexers. The task is highly important for improving literature retrieval and many other scientific investigations in biomedical research. Unfortunately, given its manual nature, the process of MeSH indexing is both time-consuming (new articles are not immediately indexed until 2 or 3 months later) and costly (approximately ten dollars per article). In response, automatic indexing by computers has been previously proposed and attempted but remains challenging. In order to advance the state of the art in automatic MeSH indexing, a community-wide shared task called BioASQ was recently organized. METHODS We propose MeSH Now, an integrated approach that first uses multiple strategies to generate a combined list of candidate MeSH terms for a target article. Through a novel learning-to-rank framework, MeSH Now then ranks the list of candidate terms based on their relevance to the target article. Finally, MeSH Now selects the highest-ranked MeSH terms via a post-processing module. RESULTS We assessed MeSH Now on two separate benchmarking datasets using traditional precision, recall and F1-score metrics. In both evaluations, MeSH Now consistently achieved over 0.60 in F-score, ranging from 0.610 to 0.612. Furthermore, additional experiments show that MeSH Now can be optimized by parallel computing in order to process MEDLINE documents on a large scale. CONCLUSIONS We conclude that MeSH Now is a robust approach with state-of-the-art performance for automatic MeSH indexing and that MeSH Now is capable of processing PubMed scale documents within a reasonable time frame. AVAILABILITY http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/MeSHNow/ .
Collapse
Affiliation(s)
- Yuqing Mao
- Nanjing University of Chinese Medicine, 138 Xianlin Avenue, Nanjing, Jiangsu, 210023, China
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
21
|
Liu X, Yang Z, Lin H, Simmons M, Lu Z. DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC SYSTEMS BIOLOGY 2017; 11:23. [PMID: 28361678 PMCID: PMC5374555 DOI: 10.1186/s12918-017-0402-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
BACKGROUND An orphan disease is any disease that affects a small percentage of the population. Orphan diseases are a great burden to patients and society, and most of them are genetic in origin. Unfortunately, our current understanding of the genes responsible for inherited orphan diseases is still quite limited. Developing effective computational algorithms to discover disease-causing genes would help unveil disease mechanisms and may enable better diagnosis and treatment. RESULTS We have developed a novel method, named as DIGNiFI (Disease causIng GeNe FInder), which uses Protein-Protein Interaction (PPI) network-based features to discover and rank candidate disease-causing genes. Specifically, our approach computes topologically similar genes by taking into account both local and global connected paths in PPI networks via Direct Neighbors and Local Random Walks, respectively. Furthermore, since genes with similar phenotypes tend to be functionally related, we have integrated PPI data with gene ontology (GO) annotations and protein complex data to further improve the performance of this approach. Results of 128 orphan diseases with 1184 known disease genes collected from the Orphanet show that our proposed methods outperform existing state-of-the-art methods for discovering candidate disease-causing genes. We also show that further performance improvement can be achieved when enriching the human-curated PPI network data with text-mined interactions from the biomedical literature. Finally, we demonstrate the utility of our approach by applying our method to identifying novel candidate genes for a set of four inherited retinal dystrophies. In this study, we found the top predictions for these retinal dystrophies consistent with literature reports and online databases of other retinal dystrophies. CONCLUSIONS Our method successfully prioritizes orphan-disease-causative genes. This method has great potential to benefit the field of orphan disease research, where resources are scarce and greatly needed.
Collapse
Affiliation(s)
- Xiaoxia Liu
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.,National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Michael Simmons
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health, Bethesda, 20894, MD, USA.
| |
Collapse
|
22
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Wilbur WJ, Comeau DC, Dolinski K, Tyers M. The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw147. [PMID: 28077563 PMCID: PMC5225395 DOI: 10.1093/database/baw147] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Revised: 10/14/2016] [Accepted: 10/18/2016] [Indexed: 11/13/2022]
Abstract
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein–protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report. Database URL:http://bioc.sourceforge.net/BioC-BioGRID.html
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.,Mount Sinai Hospital, The Lunenfeld-Tanenbaum Research Institute, Canada
| |
Collapse
|
23
|
Abstract
In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Collapse
Affiliation(s)
- Patrick Ruch
- SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland.
- BiTeM Group, HES-SO\HEG Genève, 7 route de Drize, CH-1227, Carouge, Switzerland.
| |
Collapse
|
24
|
Singhal A, Leaman R, Catlett N, Lemberger T, McEntyre J, Polson S, Xenarios I, Arighi C, Lu Z. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges. Database (Oxford) 2016; 2016:baw161. [PMID: 28025348 PMCID: PMC5199160 DOI: 10.1093/database/baw161] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Revised: 11/10/2016] [Accepted: 11/11/2016] [Indexed: 12/24/2022]
Abstract
Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.
Collapse
Affiliation(s)
- Ayush Singhal
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Shawn Polson
- Center for Bioinformatics and Computational Biology and Department of Computer and Information Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711, USA
| | | | - Cecilia Arighi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
- Center for Bioinformatics and Computational Biology and Department of Computer and Information Sciences, Delaware Biotechnology Institute, University of Delaware, Newark, DE 19711, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
25
|
Peng Y, Wei CH, Lu Z. Improving chemical disease relation extraction with rich features and weakly labeled data. J Cheminform 2016; 8:53. [PMID: 28316651 PMCID: PMC5054544 DOI: 10.1186/s13321-016-0165-z] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 09/28/2016] [Indexed: 01/08/2023] Open
Abstract
Background Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. Results We propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. Conclusions Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
Collapse
Affiliation(s)
- Yifan Peng
- National Center for Biotechnology Information, Bethesda, MD 20894 USA ; Computer and Information Sciences, University of Delaware, Newark, DE 19716 USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, Bethesda, MD 20894 USA
| |
Collapse
|
26
|
Wang Q, S Abdul S, Almeida L, Ananiadou S, Balderas-Martínez YI, Batista-Navarro R, Campos D, Chilton L, Chou HJ, Contreras G, Cooper L, Dai HJ, Ferrell B, Fluck J, Gama-Castro S, George N, Gkoutos G, Irin AK, Jensen LJ, Jimenez S, Jue TR, Keseler I, Madan S, Matos S, McQuilton P, Milacic M, Mort M, Natarajan J, Pafilis E, Pereira E, Rao S, Rinaldi F, Rothfels K, Salgado D, Silva RM, Singh O, Stefancsik R, Su CH, Subramani S, Tadepally HD, Tsaprouni L, Vasilevsky N, Wang X, Chatr-Aryamontri A, Laulederkind SJF, Matis-Mitchell S, McEntyre J, Orchard S, Pundir S, Rodriguez-Esteban R, Van Auken K, Lu Z, Schaeffer M, Wu CH, Hirschman L, Arighi CN. Overview of the interactive task in BioCreative V. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw119. [PMID: 27589961 PMCID: PMC5009325 DOI: 10.1093/database/baw119] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Accepted: 07/28/2016] [Indexed: 11/14/2022]
Abstract
Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested. Database URL:http://www.biocreative.org
Collapse
Affiliation(s)
- Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Shabbir S Abdul
- International Centre of Health Information Technology, Taipei Medical University, Taipei, Taiwan
| | - Lara Almeida
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sophia Ananiadou
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | | | | | | | - Lucy Chilton
- Northern Institute for Cancer Research, Newcastle University, New Castle, UK
| | - Hui-Jou Chou
- Rutgers University-Camden, Camden, NJ 08102, USA
| | - Gabriela Contreras
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | - Laurel Cooper
- Department of Botany and Plant Pathology, Oregon State University Corvallis, OR 97331, USA
| | - Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Barbra Ferrell
- College of Agriculture and Natural Resources, University of Delaware, Newark, DE 19711, USA
| | - Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 04510 Ciudad de México, México
| | | | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham B15 2TT, UK Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham B15 2TT, UK
| | - Afroza K Irin
- Life Science Informatics, University of Bonn, Bonn, Germany
| | - Lars J Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Silvia Jimenez
- Blue Brain Project, École Polytechnique Fédérale de Lausanne (EPFL) Biotech Campus, Geneva, Switzerland
| | - Toni R Jue
- Prince of Wales Clinical School, University of New South Wales NSW, Sydney, New South Wales, Australia
| | | | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, 53754 St. Augustin, Germany
| | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | | | - Marija Milacic
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - Matthew Mort
- HGMD, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | - Jeyakumar Natarajan
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece
| | - Emiliano Pereira
- Microbial Genomics and Bioinformatics Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
| | - Shruti Rao
- Innovation Center for Biomedical Informatics (ICBI), Georgetown University, Washington, DC 20007, USA
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland
| | - Karen Rothfels
- Department of Informatics and Bio-Computing, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
| | - David Salgado
- GMGF, Aix-Marseille Universite, 13385 Marseille, France Inserm, UMR_S 910, 13385 Marseille, France
| | - Raquel M Silva
- Department of Medical Sciences, iBiMED & IEETA, University of Aveiro, 3810-193 Aveiro, Portugal
| | - Onkar Singh
- Taipei Medical University Graduate Institute of Biomedical informatics, Taipei, Taiwan
| | | | - Chu-Hsien Su
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Suresh Subramani
- Department of Bioinformatics, Bharathiar University, Coimbatore, Tamil Nadu, India
| | | | - Loukia Tsaprouni
- Institute of Sport and Physical Activity Research (ISPAR), University of Bedfordshire, Bedford, UK
| | - Nicole Vasilevsky
- Ontology Development Group, Oregon Health & Science University, Portland, OR 97239, USA
| | - Xiaodong Wang
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | | | | | | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Sangya Pundir
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | | | - Kimberly Van Auken
- WormBase Consortium, Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Institutes of Health, Bethesda, MD 20894, USA
| | - Mary Schaeffer
- MaizeGDB USDA ARS and University of Missouri, Columbia, MO 65211, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | | | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| |
Collapse
|
27
|
Fluck J, Madan S, Ansari S, Kodamullil AT, Karki R, Rastegar-Mojarad M, Catlett NL, Hayes W, Szostak J, Hoeng J, Peitsch M. Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw113. [PMID: 27554092 PMCID: PMC4995071 DOI: 10.1093/database/baw113] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 07/07/2016] [Indexed: 01/21/2023]
Abstract
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL:http://wiki.openbel.org/display/BIOC/Datasets
Collapse
Affiliation(s)
- Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sam Ansari
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Alpha T Kodamullil
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Reagon Karki
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | | | | | - William Hayes
- Selventa, One Alewife Center, Cambridge, MA 02140, USA
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Manuel Peitsch
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| |
Collapse
|
28
|
Pérez-Pérez M, Pérez-Rodríguez G, Rabal O, Vazquez M, Oyarzabal J, Fdez-Riverola F, Valencia A, Krallinger M, Lourenço A. The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw120. [PMID: 27542845 PMCID: PMC5001550 DOI: 10.1093/database/baw120] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 08/02/2016] [Indexed: 01/08/2023]
Abstract
Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing solutions to problems of practical interest. Most notably, community-oriented initiatives such as the BioCreative challenge have enabled controlled environments for the comparison of automatic systems while pursuing practical biomedical tasks. Under this scenario, the present work describes the Markyt Web-based document curation platform, which has been implemented to support the visualisation, prediction and benchmark of chemical and gene mention annotations at BioCreative/CHEMDNER challenge. Creating this platform is an important step for the systematic and public evaluation of automatic prediction systems and the reusability of the knowledge compiled for the challenge. Markyt was not only critical to support the manual annotation and annotation revision process but also facilitated the comparative visualisation of automated results against the manually generated Gold Standard annotations and comparative assessment of generated results. We expect that future biomedical text mining challenges and the text mining community may benefit from the Markyt platform to better explore and interpret annotations and improve automatic system predictions.Database URL: http://www.markyt.org, https://github.com/sing-group/Markyt.
Collapse
Affiliation(s)
- Martin Pérez-Pérez
- ESEI - Department of Computer Science, University of Vigo, Ourense, Spain
| | | | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| | - Miguel Vazquez
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| | | | - Alfonso Valencia
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo, Ourense, Spain Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| |
Collapse
|
29
|
Mottin L, Gobeill J, Pasche E, Michel PA, Cusin I, Gaudet P, Ruch P. neXtA5: accelerating annotation of articles via automated approaches in neXtProt. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw098. [PMID: 27374119 PMCID: PMC4930835 DOI: 10.1093/database/baw098] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Accepted: 05/31/2016] [Indexed: 11/13/2022]
Abstract
The rapid increase in the number of published articles poses a challenge for curated databases to remain up-to-date. To help the scientific community and database curators deal with this issue, we have developed an application, neXtA5, which prioritizes the literature for specific curation requirements. Our system, neXtA5, is a curation service composed of three main elements. The first component is a named-entity recognition module, which annotates MEDLINE over some predefined axes. This report focuses on three axes: Diseases, the Molecular Function and Biological Process sub-ontologies of the Gene Ontology (GO). The automatic annotations are then stored in a local database, BioMed, for each annotation axis. Additional entities such as species and chemical compounds are also identified. The second component is an existing search engine, which retrieves the most relevant MEDLINE records for any given query. The third component uses the content of BioMed to generate an axis-specific ranking, which takes into account the density of named-entities as stored in the Biomed database. The two ranked lists are ultimately merged using a linear combination, which has been specifically tuned to support the annotation of each axis. The fine-tuning of the coefficients is formally reported for each axis-driven search. Compared with PubMed, which is the system used by most curators, the improvement is the following: +231% for Diseases, +236% for Molecular Functions and +3153% for Biological Process when measuring the precision of the top-returned PMID (P0 or mean reciprocal rank). The current search methods significantly improve the search effectiveness of curators for three important curation axes. Further experiments are being performed to extend the curation types, in particular protein-protein interactions, which require specific relationship extraction capabilities. In parallel, user-friendly interfaces powered with a set of JSON web services are currently being implemented into the neXtProt annotation pipeline.Available on: http://babar.unige.ch:8082/neXtA5Database URL: http://babar.unige.ch:8082/neXtA5/fetcher.jsp.
Collapse
Affiliation(s)
- Luc Mottin
- BiTeM Group, University of Applied Sciences, Western Switzerland-HEG Genève, Information Science Department SIB Text Mining, Swiss Institute of Bioinformatics
| | - Julien Gobeill
- BiTeM Group, University of Applied Sciences, Western Switzerland-HEG Genève, Information Science Department SIB Text Mining, Swiss Institute of Bioinformatics
| | - Emilie Pasche
- BiTeM Group, University of Applied Sciences, Western Switzerland-HEG Genève, Information Science Department SIB Text Mining, Swiss Institute of Bioinformatics
| | | | | | | | - Patrick Ruch
- BiTeM Group, University of Applied Sciences, Western Switzerland-HEG Genève, Information Science Department SIB Text Mining, Swiss Institute of Bioinformatics
| |
Collapse
|
30
|
Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 939:139-166. [PMID: 27807747 DOI: 10.1007/978-981-10-1503-8_7] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Collapse
|
31
|
Liu RL. Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles. PLoS One 2015; 10:e0139245. [PMID: 26440794 PMCID: PMC4595445 DOI: 10.1371/journal.pone.0139245] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 09/09/2015] [Indexed: 11/19/2022] Open
Abstract
Biomedical literature is an essential source of biomedical evidence. To translate the evidence for biomedicine study, researchers often need to carefully read multiple articles about specific biomedical issues. These articles thus need to be highly related to each other. They should share similar core contents, including research goals, methods, and findings. However, given an article r, it is challenging for search engines to retrieve highly related articles for r. In this paper, we present a technique PBC (Passage-based Bibliographic Coupling) that estimates inter-article similarity by seamlessly integrating bibliographic coupling with the information collected from context passages around important out-link citations (references) in each article. Empirical evaluation shows that PBC can significantly improve the retrieval of those articles that biomedical experts believe to be highly related to specific articles about gene-disease associations. PBC can thus be used to improve search engines in retrieving the highly related articles for any given article r, even when r is cited by very few (or even no) articles. The contribution is essential for those researchers and text mining systems that aim at cross-validating the evidence about specific gene-disease associations.
Collapse
Affiliation(s)
- Rey-Long Liu
- Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan, R. O. C
- * E-mail:
| |
Collapse
|
32
|
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BIOMED RESEARCH INTERNATIONAL 2015; 2015:918710. [PMID: 26380306 PMCID: PMC4561873 DOI: 10.1155/2015/918710] [Citation(s) in RCA: 119] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/03/2015] [Accepted: 04/04/2015] [Indexed: 02/01/2023]
Abstract
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator.
Collapse
|
33
|
Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 2015; 17:132-44. [PMID: 25935162 DOI: 10.1093/bib/bbv024] [Citation(s) in RCA: 107] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Indexed: 11/13/2022] Open
Abstract
One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations.
Collapse
|
34
|
Khare R, Good BM, Leaman R, Su AI, Lu Z. Crowdsourcing in biomedicine: challenges and opportunities. Brief Bioinform 2015; 17:23-32. [PMID: 25888696 DOI: 10.1093/bib/bbv021] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The use of crowdsourcing to solve important but complex problems in biomedical and clinical sciences is growing and encompasses a wide variety of approaches. The crowd is diverse and includes online marketplace workers, health information seekers, science enthusiasts and domain experts. In this article, we review and highlight recent studies that use crowdsourcing to advance biomedicine. We classify these studies into two broad categories: (i) mining big data generated from a crowd (e.g. search logs) and (ii) active crowdsourcing via specific technical platforms, e.g. labor markets, wikis, scientific games and community challenges. Through describing each study in detail, we demonstrate the applicability of different methods in a variety of domains in biomedical research, including genomics, biocuration and clinical research. Furthermore, we discuss and highlight the strengths and limitations of different crowdsourcing platforms. Finally, we identify important emerging trends, opportunities and remaining challenges for future crowdsourcing research in biomedicine.
Collapse
|
35
|
Liu W, Laulederkind SJF, Hayman GT, Wang SJ, Nigam R, Smith JR, De Pons J, Dwinell MR, Shimoyama M. OntoMate: a text-mining tool aiding curation at the Rat Genome Database. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bau129. [PMID: 25619558 PMCID: PMC4305386 DOI: 10.1093/database/bau129] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. RGD spends considerable effort manually curating gene, Quantitative Trait Locus (QTL) and strain information. The rapidly growing volume of biomedical literature and the active research in the biological natural language processing (bioNLP) community have given RGD the impetus to adopt text-mining tools to improve curation efficiency. Recently, RGD has initiated a project to use OntoMate, an ontology-driven, concept-based literature search engine developed at RGD, as a replacement for the PubMed (http://www.ncbi.nlm.nih.gov/pubmed) search engine in the gene curation workflow. OntoMate tags abstracts with gene names, gene mutations, organism name and most of the 16 ontologies/vocabularies used at RGD. All terms/ entities tagged to an abstract are listed with the abstract in the search results. All listed terms are linked both to data entry boxes and a term browser in the curation tool. OntoMate also provides user-activated filters for species, date and other parameters relevant to the literature search. Using the system for literature search and import has streamlined the process compared to using PubMed. The system was built with a scalable and open architecture, including features specifically designed to accelerate the RGD gene curation process. With the use of bioNLP tools, RGD has added more automation to its curation workflow. Database URL:http://rgd.mcw.edu
Collapse
Affiliation(s)
- Weisong Liu
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Stanley J F Laulederkind
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - G Thomas Hayman
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Shur-Jen Wang
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Rajni Nigam
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Jennifer R Smith
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Jeff De Pons
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Melinda R Dwinell
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| | - Mary Shimoyama
- Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA Human and Molecular Genetics Center, Medical College of Wisconsin, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Department of Physiology, Medical College of Wisconsin and Department of Surgery, Medical College of Wisconsin, 8701 Watertown Plank Rd, Milwaukee, WI 53226-3548, USA
| |
Collapse
|
36
|
Emadzadeh E, Nikfarjam A, Ginn RE, Gonzalez G. Unsupervised gene function extraction using semantic vectors. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau084. [PMID: 25209025 PMCID: PMC4160099 DOI: 10.1093/database/bau084] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Finding gene functions discussed in the literature is an important task of information extraction (IE) from biomedical documents. Automated computational methodologies can significantly reduce the need for manual curation and improve quality of other related IE systems. We propose an open-IE method for the BioCreative IV GO shared task (subtask b), focused on finding gene function terms [Gene Ontology (GO) terms] for different genes in an article. The proposed open-IE approach is based on distributional semantic similarity over the GO terms. The method does not require annotated data for training, which makes it highly generalizable. We achieve an F-measure of 0.26 on the test-set in the official submission for BioCreative-GO shared task, the third highest F-measure among the seven participants in the shared task. Database URL: https://code.google.com/p/rainbow-nlp/
Collapse
Affiliation(s)
- Ehsan Emadzadeh
- Department of Biomedical Informatics, Arizona State University, AZ 85259, USA
| | - Azadeh Nikfarjam
- Department of Biomedical Informatics, Arizona State University, AZ 85259, USA
| | - Rachel E Ginn
- Department of Biomedical Informatics, Arizona State University, AZ 85259, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, AZ 85259, USA
| |
Collapse
|
37
|
Gobeill J, Pasche E, Vishnyakova D, Ruch P. Closing the loop: from paper to protein annotation using supervised Gene Ontology classification. Database (Oxford) 2014; 2014:bau088. [PMID: 25190367 PMCID: PMC4154439 DOI: 10.1093/database/bau088] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 07/22/2014] [Accepted: 07/22/2014] [Indexed: 11/29/2022]
Abstract
Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. DATABASE URL http://eagl.unige.ch/GOCat4FT/.
Collapse
Affiliation(s)
- Julien Gobeill
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland
| | - Emilie Pasche
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland
| | - Dina Vishnyakova
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland
| | - Patrick Ruch
- BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland BiTeM group, University of Applied Sciences-HEG, Library and Information Sciences, Rte de Drize 7, 1227 Geneva, Switzerland, Division of Medical Information Sciences, University and Hospitals of Geneva, Geneva, Switzerland and SIBtex group, SIB Swiss Institute of Bioinformatics, Rue Michel-Servet 1, 1206 Geneva, Switzerland
| |
Collapse
|
38
|
Zhu D, Li D, Carterette B, Liu H. Integrating information retrieval with distant supervision for gene ontology annotation. Database (Oxford) 2014; 2014:bau087. [PMID: 25183856 PMCID: PMC4150992 DOI: 10.1093/database/bau087] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Revised: 07/30/2014] [Accepted: 07/30/2014] [Indexed: 01/08/2023]
Abstract
This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. DATABASE URL https://github.com/noname2020/Bioc.
Collapse
Affiliation(s)
- Dongqing Zhu
- Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905 and Department of Computer & Information Sciences, University of Delaware, 101 SMITH HALL, Newark, DE 19716, USA Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905 and Department of Computer & Information Sciences, University of Delaware, 101 SMITH HALL, Newark, DE 19716, USA
| | - Dingcheng Li
- Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905 and Department of Computer & Information Sciences, University of Delaware, 101 SMITH HALL, Newark, DE 19716, USA
| | - Ben Carterette
- Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905 and Department of Computer & Information Sciences, University of Delaware, 101 SMITH HALL, Newark, DE 19716, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905 and Department of Computer & Information Sciences, University of Delaware, 101 SMITH HALL, Newark, DE 19716, USA
| |
Collapse
|
39
|
Liu RL, Shih CC. Identification of highly related references about gene-disease association. BMC Bioinformatics 2014; 15:286. [PMID: 25155502 PMCID: PMC4162969 DOI: 10.1186/1471-2105-15-286] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2013] [Accepted: 08/12/2014] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND Curation of gene-disease associations published in literature should be based on careful and frequent survey of the references that are highly related to specific gene-disease associations. Retrieval of the references is thus essential for timely and complete curation. RESULTS We present a technique CRFref (Conclusive, Rich, and Focused References) that, given a gene-disease pair < g, d>, ranks high those biomedical references that are likely to provide conclusive, rich, and focused results about g and d. Such references are expected to be highly related to the association between g and d. CRFref ranks candidate references based on their scores. To estimate the score of a reference r, CRFref estimates and integrates three measures: degree of conclusiveness, degree of richness, and degree of focus of r with respect to < g, d>. To evaluate CRFref, experiments are conducted on over one hundred thousand references for over one thousand gene-disease pairs. Experimental results show that CRFref performs significantly better than several typical types of baselines in ranking high those references that expert curators select to develop the summaries for specific gene-disease associations. CONCLUSION CRFref is a good technique to rank high those references that are highly related to specific gene-disease associations. It can be incorporated into existing search engines to prioritize biomedical references for curators and researchers, as well as those text mining systems that aim at the study of gene-disease associations.
Collapse
Affiliation(s)
- Rey-Long Liu
- Department of Medical Informatics, Tzu Chi University, Hualien, Taiwan.
| | | |
Collapse
|
40
|
Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, Hayman GT, Tweedie S, Schaeffer ML, Laulederkind SJF, Wang SJ, Gobeill J, Ruch P, Luu AT, Kim JJ, Chiang JH, Chen YD, Yang CJ, Liu H, Zhu D, Li Y, Yu H, Emadzadeh E, Gonzalez G, Chen JM, Dai HJ, Lu Z. Overview of the gene ontology task at BioCreative IV. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau086. [PMID: 25157073 PMCID: PMC4142793 DOI: 10.1093/database/bau086] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation. Database URL:http://www.biocreative.org/tasks/biocreative-iv/track-4-GO/.
Collapse
Affiliation(s)
- Yuqing Mao
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Kimberly Van Auken
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Donghui Li
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Cecilia N Arighi
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Peter McQuilton
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - G Thomas Hayman
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Susan Tweedie
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Mary L Schaeffer
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Stanley J F Laulederkind
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Shur-Jen Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Julien Gobeill
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Patrick Ruch
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Anh Tuan Luu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Jung-Jae Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Jung-Hsien Chiang
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Yu-De Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Chia-Jung Yang
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Hongfang Liu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Dongqing Zhu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Yanpeng Li
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Hong Yu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Ehsan Emadzadeh
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Graciela Gonzalez
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Jian-Ming Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Hong-Jie Dai
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20817, USA WormBase, Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA, TAIR, Department of Plant Biology, The Arabidopsis Information Resource, Carnegie Institution for Science, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, HES-SO, HEG, Library and Information Sciences, 7 route de Drize, CH-1227 Carouge, Switzerland, SIBtex, Swiss Institute of Bioinformatics, Rue Michel Servet 1, 1211 Geneva 4, Switzerland, School of Computer Engineering, Nanyang Technological University, Block N4, #02a-32, Nanyang Avenue, Singapore 639798, Department of Computer Science and Information Engineering, National Cheng-Kung University, No. 1, University Rd., Tainan 701, Taiwan, Republic of China, Department of Radiology, Mackay Memorial Hospital, Taitung Branch, Lane 303 Chang Sha St. Taitung, Taiwan, Republic of China, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA, Department of Computer Science, University of Delaware, 101 Smith Hall, Newark, DE 19716, USA, Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North (AC7-059), Worcester, MA 01655 USA, Department of Biomedical Informatics, Arizona State University, 13212 East Shea Boulevard Scottsdale, AZ 85259 USA, Institute of Information Science, Academia Sinica, 128 Academia Road, Secti
| |
Collapse
|
41
|
Van Auken K, Schaeffer ML, McQuilton P, Laulederkind SJF, Li D, Wang SJ, Hayman GT, Tweedie S, Arighi CN, Done J, Müller HM, Sternberg PW, Mao Y, Wei CH, Lu Z. BC4GO: a full-text corpus for the BioCreative IV GO task. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau074. [PMID: 25070993 PMCID: PMC4112614 DOI: 10.1093/database/bau074] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL:http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/.
Collapse
Affiliation(s)
- Kimberly Van Auken
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Mary L Schaeffer
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Peter McQuilton
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Stanley J F Laulederkind
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Donghui Li
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Shur-Jen Wang
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - G Thomas Hayman
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Susan Tweedie
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Cecilia N Arighi
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - James Done
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Hans-Michael Müller
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Paul W Sternberg
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USAWormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yuqing Mao
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- WormBase, Division of Biology, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, USDA-ARS Plant Genetics Research Unit and Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA, FlyBase, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA, TAIR, Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA, Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711, USA, Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125, USA, National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
42
|
Arighi CN, Wu CH, Cohen KB, Hirschman L, Krallinger M, Valencia A, Lu Z, Wilbur JW, Wiegers TC. BioCreative-IV virtual issue. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau039. [PMID: 24852177 PMCID: PMC4030502 DOI: 10.1093/database/bau039] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Affiliation(s)
- Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Kevin B Cohen
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | | | - Martin Krallinger
- Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid, Spain
| | - Alfonso Valencia
- Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid, Spain
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA
| | - John W Wilbur
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD, USA
| | - Thomas C Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
43
|
Lourenço A, Coenye T, Goeres DM, Donelli G, Azevedo AS, Ceri H, Coelho FL, Flemming HC, Juhna T, Lopes SP, Oliveira R, Oliver A, Shirtliff ME, Sousa AM, Stoodley P, Pereira MO, Azevedo NF. Minimum information about a biofilm experiment (MIABiE): standards for reporting experiments and data on sessile microbial communities living at interfaces. Pathog Dis 2014; 70:250-6. [PMID: 24478124 DOI: 10.1111/2049-632x.12146] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Revised: 01/15/2014] [Accepted: 01/15/2014] [Indexed: 02/04/2023] Open
Abstract
The minimum information about a biofilm experiment (MIABiE) initiative has arisen from the need to find an adequate and scientifically sound way to control the quality of the documentation accompanying the public deposition of biofilm-related data, particularly those obtained using high-throughput devices and techniques. Thereby, the MIABiE consortium has initiated the identification and organization of a set of modules containing the minimum information that needs to be reported to guarantee the interpretability and independent verification of experimental results and their integration with knowledge coming from other fields. MIABiE does not intend to propose specific standards on how biofilms experiments should be performed, because it is acknowledged that specific research questions require specific conditions which may deviate from any standardization. Instead, MIABiE presents guidelines about the data to be recorded and published in order for the procedure and results to be easily and unequivocally interpreted and reproduced. Overall, MIABiE opens up the discussion about a number of particular areas of interest and attempts to achieve a broad consensus about which biofilm data and metadata should be reported in scientific journals in a systematic, rigorous and understandable manner.
Collapse
Affiliation(s)
- Anália Lourenço
- Departamento de Informática, Universidade de Vigo, ESEI - Escuela Superior de Ingeniería Informática, Ourense, Spain; IBB - Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Braga, Portugal
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Khare R, Leaman R, Lu Z. Accessing biomedical literature in the current information landscape. Methods Mol Biol 2014; 1159:11-31. [PMID: 24788259 PMCID: PMC4593617 DOI: 10.1007/978-1-4939-0709-0_2] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Biomedical and life sciences literature is unique because of its exponentially increasing volume and interdisciplinary nature. Biomedical literature access is essential for several types of users including biomedical researchers, clinicians, database curators, and bibliometricians. In the past few decades, several online search tools and literature archives, generic as well as biomedicine specific, have been developed. We present this chapter in the light of three consecutive steps of literature access: searching for citations, retrieving full text, and viewing the article. The first section presents the current state of practice of biomedical literature access, including an analysis of the search tools most frequently used by the users, including PubMed, Google Scholar, Web of Science, Scopus, and Embase, and a study on biomedical literature archives such as PubMed Central. The next section describes current research and the state-of-the-art systems motivated by the challenges a user faces during query formulation and interpretation of search results. The research solutions are classified into five key areas related to text and data mining, text similarity search, semantic search, query support, relevance ranking, and clustering results. Finally, the last section describes some predicted future trends for improving biomedical literature access, such as searching and reading articles on portable devices, and adoption of the open access policy.
Collapse
Affiliation(s)
- Ritu Khare
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003B, 8600 Rockville Pike, Bethesda, MD 20894
| | - Robert Leaman
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003E, 8600 Rockville Pike, Bethesda, MD 20894
| | - Zhiyong Lu
- National Center for Biotechnology Information, U.S. National Library of Medicine, NIH, Blg 38 A, Rm 1003A, 8600 Rockville Pike, Bethesda, MD 20894
| |
Collapse
|
45
|
Hsiao JC, Wei CH, Kao HY. Gene Name Disambiguation Using Multi-Scope Species Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:55-62. [PMID: 26355507 DOI: 10.1109/tcbb.2013.139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Species detection is an important topic in the text mining field. According to the importance of the research topics (e.g., species assignment to genes and document focus species detection), some studies are dedicated to an individual topic. However, no researcher to date has discussed species detection as a general problem. Therefore, we developed a multi-scope species detection model to identify the focus species for different scopes (i.e., gene mention, sentence, paragraph, and global scope of the entire article). Species assignment is one of the bottlenecks of gene name disambiguation. In our evaluation, recognizing the focus species of a gene mention in four different scopes improved the gene name disambiguation. We used the species cue words extracted from articles to estimate the relevance between an article and a species. The relevance score was calculated by our proposed entities frequency-augmented invert species frequency (EF-AISF) formula, which represents the importance of an entity to a species. We also defined a relation guide factor (RGF) to normalize the relevance score. Our method not only achieved better performance than previous methods but also can handle the articles that do not specifically mention a species. In the DECA corpus, we outperformed previous studies and obtained an accuracy of 88.22 percent.
Collapse
|
46
|
Gour P, Garg P, Jain R, Joseph SV, Tyagi AK, Raghuvanshi S. Manually curated database of rice proteins. Nucleic Acids Res 2013; 42:D1214-21. [PMID: 24214963 PMCID: PMC3964970 DOI: 10.1093/nar/gkt1072] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
'Manually Curated Database of Rice Proteins' (MCDRP) available at http://www.genomeindia.org/biocuration is a unique curated database based on published experimental data. Semantic integration of scientific data is essential to gain a higher level of understanding of biological systems. Since the majority of scientific data is available as published literature, text mining is an essential step before the data can be integrated and made available for computer-based search in various databases. However, text mining is a tedious exercise and thus, there is a large gap in the data available in curated databases and published literature. Moreover, data in an experiment can be perceived from several perspectives, which may not reflect in the text-based curation. In order to address such issues, we have demonstrated the feasibility of digitizing the experimental data itself by creating a database on rice proteins based on in-house developed data curation models. Using these models data of individual experiments have been digitized with the help of universal ontologies. Currently, the database has data for over 1800 rice proteins curated from >4000 different experiments of over 400 research articles. Since every aspect of the experiment such as gene name, plant type, tissue and developmental stage has been digitized, experimental data can be rapidly accessed and integrated.
Collapse
Affiliation(s)
- Pratibha Gour
- Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi - 110021, India
| | | | | | | | | | | |
Collapse
|
47
|
Comeau DC, Islamaj Doğan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M, Valencia A, Verspoor K, Wiegers TC, Wu CH, Wilbur WJ. BioC: a minimalist approach to interoperability for biomedical text processing. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat064. [PMID: 24048470 PMCID: PMC3889917 DOI: 10.1093/database/bat064] [Citation(s) in RCA: 108] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894, USA, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, Harvard Medical School, Harvard University, Boston, MA 02115 USA, Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO 80045, USA, Structural and Computational Biology Group, Spanish National Cancer Research Centre, Madrid E-28029, Spain, Center for Bioinformatics and Computational Biology, Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA, Institute of Computational Linguistics, University of Zurich, Zurich 8050, Switzerland, National ICT Australia (NICTA), Victoria Research Laboratory, The University of Melbourne, Parkville VIC 3010, Australia and Department of Biology, North Carolina State University, Raleigh, NC 27695, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Gobeill J, Pasche E, Vishnyakova D, Ruch P. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat041. [PMID: 23842461 PMCID: PMC3706742 DOI: 10.1093/database/bat041] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/
Collapse
Affiliation(s)
- Julien Gobeill
- Library and Information Sciences, University of Applied Sciences - HEG, CH-1227 Geneva, Switzerland.
| | | | | | | |
Collapse
|
49
|
Jimeno-Yepes AJ, Sticco JC, Mork JG, Aronson AR. GeneRIF indexing: sentence selection based on machine learning. BMC Bioinformatics 2013; 14:171. [PMID: 23725347 PMCID: PMC3687823 DOI: 10.1186/1471-2105-14-171] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Accepted: 05/22/2013] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. RESULTS We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. CONCLUSIONS The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species.
Collapse
Affiliation(s)
- Antonio J Jimeno-Yepes
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
- NICTA Victoria Research Lab, Melbourne VIC 3010, Australia
| | - J Caitlin Sticco
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - James G Mork
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Alan R Aronson
- National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
50
|
Wei CH, Kao HY, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res 2013; 41:W518-22. [PMID: 23703206 PMCID: PMC3692066 DOI: 10.1093/nar/gkt441] [Citation(s) in RCA: 333] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Manually curating knowledge from biomedical literature into structured databases is highly expensive and time-consuming, making it difficult to keep pace with the rapid growth of the literature. There is therefore a pressing need to assist biocuration with automated text mining tools. Here, we describe PubTator, a web-based system for assisting biocuration. PubTator is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatic results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information, US National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | |
Collapse
|