1
|
Zhang L, Starr J, Ford B, Reznicek A, Zhou Y, Léveillé-Bourret É, Lacroix-Carignan É, Cayouette J, Smith TW, Sutherland D, Catling P, Saarela JM, Cui H, Macklin J. Helping authors produce FAIR taxonomic data: evaluation of an author-driven phenotype data production prototype. Database (Oxford) 2025; 2025:baae097. [PMID: 39879563 PMCID: PMC11928229 DOI: 10.1093/database/baae097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 08/01/2024] [Accepted: 08/28/2024] [Indexed: 01/31/2025]
Abstract
It is well-known that the use of vocabulary in phenotype treatments is often inconsistent. An earlier survey of biologists who create or use phenotypic characters revealed that this lack of standardization leads to ambiguities, frustrating both the consumers and producers of phenotypic data. Such ambiguities are challenging for biologists, and more so for Artificial Intelligence, to resolve. That survey also indicated a strong interest in a new authoring workflow supported by ontologies to ensure published phenotype data are FAIR (Findable, Accessible, Interoperable, and Reusable) and suitable for large-scale computational analyses. In this article, we introduce a prototype software system designed for authors to produce computational phenotype data. This platform includes a web-based, ontology-enhanced editor for taxonomic characters (Character Recorder), an Ontology Backend holding standardized vocabulary (the Cared Ontology), and a mobile application for resolving ontological conflicts (Conflict Resolver). We present two formal user evaluations of Character Recorder, the main interface authors would interact with to produce FAIR data. The evaluations were conducted with undergraduate biology students and Carex experts. We evaluated Character Recorder against Microsoft Excel on their effectiveness, efficiency, and the cognitive demands of the users in producing computable taxon-by-character matrices. The evaluations showed that Character Recorder is quickly learnable for both student and professional participants, with its cognitive demand comparable to Excel's. Participants agreed that the quality of the data Character Recorder yielded was superior. Students praised Character Recorder's educational value, while Carex experts were keen to recommend it and help evolve it from a prototype into a comprehensive tool. Feature improvements recommended by expert participants have been implemented after the evaluation.
Collapse
Affiliation(s)
- Limin Zhang
- School of Information, University of Arizona, 1103 E. 2nd Street, Tucson, AZ 85719, USA
- School of Fine Arts, Huaiyin Normal University, 71 Jiaotong Road, Huaian, Jiangsu 223001, China
| | - Julian Starr
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa, ON K1N 6N5, Canada
| | - Bruce Ford
- Department of Biological Sciences, University of Manitoba, 50 Sifton Road, Winnipeg, MB R3T 2N2, Canada
| | - Anton Reznicek
- University Herbarium, University of Michigan, 3600 Varsity Drive, Ann Arbor, MI 48108, US
| | - Yuxuan Zhou
- School of Information, University of Arizona, 1103 E. 2nd Street, Tucson, AZ 85719, USA
| | - Étienne Léveillé-Bourret
- Department of Biological Sciences, Université de Montréal, 1375 Avenue Thérèse-Lavoie-Roux, Montréal, QC H3A 2B3, Canada
| | - Étienne Lacroix-Carignan
- Department of Biological Sciences, Université de Montréal, 1375 Avenue Thérèse-Lavoie-Roux, Montréal, QC H3A 2B3, Canada
| | - Jacques Cayouette
- Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, ON CA K1A 0C6, Canada
| | - Tyler W Smith
- Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, ON CA K1A 0C6, Canada
| | - Donald Sutherland
- Natural Heritage Information Centre, Ontario Ministry of Natural Resources, P.O. Box 7000, Peterborough, Ontario K9J 8M5, Canada
| | - Paul Catling
- Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, ON CA K1A 0C6, Canada
| | - Jeffery M Saarela
- Research and Collections, Canadian Museum of Nature, 240 McLeod St, Ottawa, ON K1P 6P4, Canada
| | - Hong Cui
- School of Information, University of Arizona, 1103 E. 2nd Street, Tucson, AZ 85719, USA
| | - James Macklin
- Research and Development Centre, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, ON CA K1A 0C6, Canada
| |
Collapse
|
2
|
Kim S, Yu B, Li Q, Bolton EE. PubChem synonym filtering process using crowdsourcing. J Cheminform 2024; 16:69. [PMID: 38880887 PMCID: PMC11181558 DOI: 10.1186/s13321-024-00868-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 06/09/2024] [Indexed: 06/18/2024] Open
Abstract
PubChem ( https://pubchem.ncbi.nlm.nih.gov ) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a "chemical synonym"). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem's crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem's filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.
Collapse
Affiliation(s)
- Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
3
|
Dessimoz C, Thomas PD. AI and the democratization of knowledge. Sci Data 2024; 11:268. [PMID: 38443367 PMCID: PMC10915151 DOI: 10.1038/s41597-024-03099-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 02/28/2024] [Indexed: 03/07/2024] Open
Affiliation(s)
- Christophe Dessimoz
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, USA.
| |
Collapse
|
4
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
5
|
Ramsey J, McIntosh B, Renfro D, Aleksander SA, LaBonte S, Ross C, Zweifel AE, Liles N, Farrar S, Gill JJ, Erill I, Ades S, Berardini TZ, Bennett JA, Brady S, Britton R, Carbon S, Caruso SM, Clements D, Dalia R, Defelice M, Doyle EL, Friedberg I, Gurney SMR, Hughes L, Johnson A, Kowalski JM, Li D, Lovering RC, Mans TL, McCarthy F, Moore SD, Murphy R, Paustian TD, Perdue S, Peterson CN, Prüß BM, Saha MS, Sheehy RR, Tansey JT, Temple L, Thorman AW, Trevino S, Vollmer AC, Walbot V, Willey J, Siegele DA, Hu JC. Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO). PLoS Comput Biol 2021; 17:e1009463. [PMID: 34710081 PMCID: PMC8553046 DOI: 10.1371/journal.pcbi.1009463] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.
Collapse
Affiliation(s)
- Jolene Ramsey
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Brenley McIntosh
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Daniel Renfro
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Suzanne A. Aleksander
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Sandra LaBonte
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Curtis Ross
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Adrienne E. Zweifel
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Nathan Liles
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Shabnam Farrar
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Jason J. Gill
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
- Department of Animal Science, Texas A&M University, College Station, Texas, United States of America
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
- Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Sarah Ades
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Tanya Z. Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Jennifer A. Bennett
- Department of Biology and Earth Science, Otterbein University, Westerville, Ohio, United States of America
| | - Siobhan Brady
- Department of Plant Biology and Genome Center, University of California Davis, Davis, California, United States of America
| | - Robert Britton
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Steven M. Caruso
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Dave Clements
- Department of Biology, John Hopkins University, Baltimore, Maryland, United States of America
| | - Ritu Dalia
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Meredith Defelice
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Erin L. Doyle
- Biology Department, Doane University, Crete, Nebraska, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Susan M. R. Gurney
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Lee Hughes
- Department of Biological Sciences, University of North Texas, Denton, Texas, United States of America
| | - Allison Johnson
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, United States of America
| | - Jason M. Kowalski
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Donghui Li
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Ruth C. Lovering
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Tamara L. Mans
- Department of Biochemistry and Biotechnology, Minnesota State University Moorhead, Brooklyn Park, Minnesota, United States of America
| | - Fiona McCarthy
- Department of Basic Science, College of Veterinary Medicine, Mississippi State University, Starkville, Mississippi, United States of America
| | - Sean D. Moore
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, Florida, United States of America
| | - Rebecca Murphy
- Department of Biology, Centenary College of Louisiana, Shreveport, Louisiana, United States of America
| | - Timothy D. Paustian
- Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sarah Perdue
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Celeste N. Peterson
- Biology Department, Suffolk University, Boston, Massachusetts, United States of America
| | - Birgit M. Prüß
- Microbiological Sciences Department, North Dakota State University, Fargo, North Dakota, United States of America
| | - Margaret S. Saha
- Department of Biology, College of William & Mary, Williamsburg, Virginia, United States of America
| | - Robert R. Sheehy
- Biology Department, Radford University, Radford, Virginia, United States of America
| | - John T. Tansey
- Department of Biochemistry and Molecular Biology, Otterbein University, Westerville, Ohio, United States of America
| | - Louise Temple
- School of Integrated Sciences, James Madison University, Harrisonburg, Virginia, United States of America
| | - Alexander William Thorman
- Department of Environmental and Public Health Sciences, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Saul Trevino
- Department of Chemistry, Math, and Physics, Houston Baptist University, Houston, Texas, United States of America
| | - Amy Cheng Vollmer
- Department of Biology, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Virginia Walbot
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Joanne Willey
- Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States of America
| | - Deborah A. Siegele
- Department of Biology, Texas A&M University, College Station, Texas, United States of America
| | - James C. Hu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| |
Collapse
|
6
|
Cui H, Zhang L, Ford B, Cheng HL, Macklin JA, Reznicek A, Starr J. Measurement Recorder: developing a useful tool for making species descriptions that produces computable phenotypes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5995854. [PMID: 33216896 PMCID: PMC7678789 DOI: 10.1093/database/baaa079] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/24/2020] [Accepted: 08/27/2020] [Indexed: 12/31/2022]
Abstract
To use published phenotype information in computational analyses, there have been efforts to convert descriptions of phenotype characters from human languages to ontologized statements. This postpublication curation process is not only slow and costly, it is also burdened with significant intercurator variation (including curator-author variation), due to different interpretations of a character by various individuals. This problem is inherent in any human-based intellectual activity. To address this problem, making scientific publications semantically clear (i.e. computable) by the authors at the time of publication is a critical step if we are to avoid postpublication curation. To help authors efficiently produce species phenotypes while producing computable data, we are experimenting with an author-driven ontology development approach and developing and evaluating a series of ontology-aware software modules that would create publishable species descriptions that are readily useable in scientific computations. The first software module prototype called Measurement Recorder has been developed to assist authors in defining continuous measurements and reported in this paper. Two usability studies of the software were conducted with 22 undergraduate students majoring in information science and 32 in biology. Results suggest that participants can use Measurement Recorder without training and they find it easy to use after limited practice. Participants also appreciate the semantic enhancement features. Measurement Recorder's character reuse features facilitate character convergence among participants by 48% and have the potential to further reduce user errors in defining characters. A set of software design issues have also been identified and then corrected. Measurement Recorder enables authors to record measurements in a semantically clear manner and enriches phenotype ontology along the way. Future work includes representing the semantic data as Resource Description Framework (RDF) knowledge graphs and characterizing the division of work between authors as domain knowledge providers and ontology engineers as knowledge formalizers in this new author-driven ontology development approach.
Collapse
Affiliation(s)
- Hong Cui
- School of Information, University of Arizona, Tucson, AZ 85705, USA
| | - Limin Zhang
- School of Information, University of Arizona, Tucson, AZ 85705, USA
| | - Bruce Ford
- Department of Biological sciences, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Hsin-Liang Cheng
- Curtis Laws Wilson Library, Missouri University of Science and Technology, Rolla, MO 65409, USA
| | - James A Macklin
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, Ottawa, ON K1A 0C6, Canada
| | - Anton Reznicek
- LSA Herbarium, University of Michigan, Ann Arbor, MI 48019, USA
| | - Julian Starr
- Department of Biology, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
7
|
Salimi N, Edwards L, Foos G, Greenbaum JA, Martini S, Reardon B, Shackelford D, Vita R, Zalman L, Peters B, Sette A. A behind-the-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts. Immunology 2020; 161:139-147. [PMID: 32615639 PMCID: PMC7496777 DOI: 10.1111/imm.13234] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 06/11/2020] [Accepted: 06/22/2020] [Indexed: 12/13/2022] Open
Abstract
The Immune Epitope Database and Analysis Resource (IEDB) provides the scientific community with open access to epitope data, as well as epitope prediction and analysis tools. The IEDB houses the most extensive collection of experimentally validated B‐cell and T‐cell epitope data, sourced primarily from published literature by expert curation. The data procurement requires systematic identification, categorization, curation and quality‐checking processes. Here, we provide insights into these processes, with particular focus on the dividends they have paid in terms of attaining project milestones, as well as how objective analyses of our processes have identified opportunities for process optimization. These experiences are shared as a case study of the benefits of process implementation and review in biomedical big data, as well as to encourage idea‐sharing among players in this ever‐growing space.
Collapse
Affiliation(s)
- Nima Salimi
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Lindy Edwards
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Gabriele Foos
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Jason A Greenbaum
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Sheridan Martini
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Brian Reardon
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Deborah Shackelford
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Randi Vita
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Leora Zalman
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Bjoern Peters
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| | - Alessandro Sette
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, USA.,Department of Medicine, University of California, San Diego, San Diego, CA, USA
| |
Collapse
|
8
|
Lock A, Harris MA, Rutherford K, Hayles J, Wood V. Community curation in PomBase: enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5827230. [PMID: 32353878 PMCID: PMC7192550 DOI: 10.1093/database/baaa028] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Revised: 02/28/2020] [Accepted: 03/22/2020] [Indexed: 11/22/2022]
Abstract
Maximizing the impact and value of scientific research requires efficient knowledge distribution, which increasingly depends on the integration of standardized published data into online databases. To make data integration more comprehensive and efficient for fission yeast research, PomBase has pioneered a community curation effort that engages publication authors directly in FAIR-sharing of data representing detailed biological knowledge from hypothesis-driven experiments. Canto, an intuitive online curation tool that enables biologists to describe their detailed functional data using shared ontologies, forms the core of PomBase’s system. With 8 years’ experience, and as the author response rate reaches 50%, we review community curation progress and the insights we have gained from the project. We highlight incentives and nudges we deploy to maximize participation, and summarize project outcomes, which include increased knowledge integration and dissemination as well as the unanticipated added value arising from co-curation by publication authors and professional curators.
Collapse
Affiliation(s)
- Antonia Lock
- Department of Genetics, Evolution and Environment, University College London, Gower street, London WC1E 6BT, UK
| | - Midori A Harris
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Kim Rutherford
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Jacqueline Hayles
- Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK
| | - Valerie Wood
- Cell Cycle Laboratory, The Francis Crick Institute, Midland Rd, London NW1 1AT, UK
| |
Collapse
|
9
|
Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020; 2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]
Abstract
Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Juancarlos N Chan
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
10
|
Abstract
Data, including information generated from them by processing and analysis, are an asset with measurable value. The assets that biological research funding produces are the data generated, the information derived from these data, and, ultimately, the discoveries and knowledge these lead to. From the time when Henry Oldenburg published the first scientific journal in 1665 (Proceedings of the Royal Society) to the founding of the United States National Library of Medicine in 1879 to the present, there has been a sustained drive to improve how researchers can record and discover what is known. Researchers’ experimental work builds upon years and (collectively) billions of dollars’ worth of earlier work. Today, researchers are generating data at ever-faster rates because of advances in instrumentation and technology, coupled with decreases in production costs. Unfortunately, the ability of researchers to manage and disseminate their results has not kept pace, so their work cannot achieve its maximal impact. Strides have recently been made, but more awareness is needed of the essential role that biological data resources, including biocuration, play in maintaining and linking this ever-growing flood of data and information. The aim of this paper is to describe the nature of data as an asset, the role biocurators play in increasing its value, and consistent, practical means to measure effectiveness that can guide planning and justify costs in biological research information resources’ development and management.
Collapse
|
11
|
Gabella C, Durinx C, Appel R. Funding knowledgebases: Towards a sustainable funding model for the UniProt use case. F1000Res 2017; 6. [PMID: 29333230 PMCID: PMC5747334 DOI: 10.12688/f1000research.12989.2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/19/2018] [Indexed: 11/30/2022] Open
Abstract
Millions of life scientists across the world rely on bioinformatics data resources for their research projects. Data resources can be very expensive, especially those with a high added value as the expert-curated knowledgebases. Despite the increasing need for such highly accurate and reliable sources of scientific information, most of them do not have secured funding over the near future and often depend on short-term grants that are much shorter than their planning horizon. Additionally, they are often evaluated as research projects rather than as research infrastructure components. In this work, twelve funding models for data resources are described and applied on the case study of the Universal Protein Resource (UniProt), a key resource for protein sequences and functional information knowledge. We show that most of the models present inconsistencies with open access or equity policies, and that while some models do not allow to cover the total costs, they could potentially be used as a complementary income source. We propose the
Infrastructure Model as a sustainable and equitable model for all core data resources in the life sciences. With this model, funding agencies would set aside a fixed percentage of their research grant volumes, which would subsequently be redistributed to core data resources according to well-defined selection criteria. This model, compatible with the principles of open science, is in agreement with several international initiatives such as the Human Frontiers Science Program Organisation (HFSPO) and the OECD Global Science Forum (GSF) project. Here, we have estimated that less than 1% of the total amount dedicated to research grants in the life sciences would be sufficient to cover the costs of the core data resources worldwide, including both knowledgebases and deposition databases.
Collapse
Affiliation(s)
- Chiara Gabella
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Christine Durinx
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Ron Appel
- ELIXIR-Switzerland, SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| |
Collapse
|
12
|
Abstract
Two decades after the first Plasmodium transfection, attempts have been made to disrupt more than 3,151 genes in malaria parasites, across five Plasmodium species. While results from rodent malaria transfections have been curated and systematised, empowering large-scale analysis, phenotypic data from human malaria parasite transfections currently exists as individual reports scattered across a the literature. To facilitate systematic analysis of published experimental genetic data across Plasmodium species, we have built PhenoPlasm ( http://www.phenoplasm.org), a database of phenotypes generated by transfection experiments in all Plasmodium parasites. The site provides a simple interface linking citation-backed Plasmodium reverse-genetic phenotypes to gene IDs. The database has been populated with phenotypic data on 367 P. falciparum genes, curated from 176 individual publications, as well as existing data on rodent Plasmodium species from RMgmDB and PlasmoGEM. This is the first time that all available data on P. falciparum transfection experiments has been brought together in a single place. These data are presented using ortholog mapping to allow a researcher interested in a gene in one species to see results across other Plasmodium species. The collaborative nature of the database enables any researcher to add new phenotypes as they are discovered. As an example of database utility, we use the currently available datasets to identify RAP (RNA-binding domain abundant in Apicomplexa)-domain containing proteins as crucial to parasite survival.
Collapse
Affiliation(s)
- Theo Sanderson
- Malaria Programme, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Julian C Rayner
- Malaria Programme, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
13
|
Sanderson T, Rayner JC. PhenoPlasm: a database of disruption phenotypes for malaria parasite genes. Wellcome Open Res 2017; 2:45. [PMID: 28748223 PMCID: PMC5500895 DOI: 10.12688/wellcomeopenres.11896.2] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/24/2017] [Indexed: 12/12/2022] Open
Abstract
Two decades after the first Plasmodium transfection, attempts have been made to disrupt more than 3,151 genes in malaria parasites, across five Plasmodium species. While results from rodent malaria transfections have been curated and systematised, empowering large-scale analysis, phenotypic data from human malaria parasite transfections currently exists as individual reports scattered across a the literature. To facilitate systematic analysis of published experimental genetic data across Plasmodium species, we have built PhenoPlasm ( http://www.phenoplasm.org), a database of phenotypes generated by transfection experiments in all Plasmodium parasites. The site provides a simple interface linking citation-backed Plasmodium reverse-genetic phenotypes to gene IDs. The database has been populated with phenotypic data on 367 P. falciparum genes, curated from 176 individual publications, as well as existing data on rodent Plasmodium species from RMgmDB and PlasmoGEM. This is the first time that all available data on P. falciparum transfection experiments has been brought together in a single place. These data are presented using ortholog mapping to allow a researcher interested in a gene in one species to see results across other Plasmodium species. The collaborative nature of the database enables any researcher to add new phenotypes as they are discovered. As an example of database utility, we use the currently available datasets to identify RAP (RNA-binding domain abundant in Apicomplexa)-domain containing proteins as crucial to parasite survival.
Collapse
Affiliation(s)
- Theo Sanderson
- Malaria Programme, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Julian C. Rayner
- Malaria Programme, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|